我有此文字
''``Sam D. Richards先生住在这里,新西22街44号 纽约,纽约12345 .您现在可以联系他吗?如果您需要任何帮助,请致电 我叫12345678'''
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678'''
.如何使用NLTK从上述文本中提取地址部分?我尝试了Stanford NER Tagger,它只给我New York作为位置.该如何解决?
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?
推荐答案绝对是正则表达式:)
类似
import re txt = ... regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}" address = re.findall(regexp, txt) # address = ['44 West 22nd Street, New York, NY 12345']说明:
[0-9]{1,3}:1到3位数字,地址号码
[0-9]{1,3}: 1 to 3 digits, the address number
(space):数字和街道名称之间的空格
(space): a space between the number and the street name
.+:街道名称,出现任意次数的任意字符
.+: street name, any character for any number of occurrences
,:城市前的逗号和空格
,: a comma and a space before the city
.+:城市,出现任意次数的任意字符
.+: city, any character for any number of occurrences
,:状态前的逗号和空格
,: a comma and a space before the state
[A-Z]{2}:从A到Z恰好是2个大写字符
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}:5位数字
re.findall(expr, string)将返回一个数组,其中包含所有找到的匹配项.
re.findall(expr, string) will return an array with all the occurrences found.
更多推荐
如何在python中使用NLTK从原始文本中提取地址?
发布评论