如何在python中使用NLTK从原始文本中提取地址?

编程入门行业动态更新时间:2024-10-27 21:10:41

本文介绍了如何在python中使用NLTK从原始文本中提取地址?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有此文字

''``Sam D. Richards先生住在这里，新西22街44号纽约，纽约12345 .您现在可以联系他吗?如果您需要任何帮助，请致电我叫12345678'''

'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678'''

.如何使用NLTK从上述文本中提取地址部分?我尝试了Stanford NER Tagger，它只给我New York作为位置.该如何解决?

. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

推荐答案

绝对是正则表达式:)

类似

import re txt = ... regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}" address = re.findall(regexp, txt) # address = ['44 West 22nd Street, New York, NY 12345']

说明:

[0-9]{1,3}:1到3位数字，地址号码

[0-9]{1,3}: 1 to 3 digits, the address number

(space):数字和街道名称之间的空格

(space): a space between the number and the street name

.+:街道名称，出现任意次数的任意字符

.+: street name, any character for any number of occurrences

,:城市前的逗号和空格

,: a comma and a space before the city

.+:城市，出现任意次数的任意字符

.+: city, any character for any number of occurrences

,:状态前的逗号和空格