简体   繁体   English

如何在 Python 中从字符串中提取地址数据

[英]How to extract the address data from a String in Python

I am trying to extract relevant address info form an string and discard the garbage.我正在尝试从字符串中提取相关地址信息并丢弃垃圾。 So this:所以这:

al domicilio social de la Compañía, Avenida de Burgos, 109 - 28050 Madrid (Indicar Asesoría Jurídica – Protección de Datos) 

Should be:应该:

Avenida de Burgos, 109 - 28050 Madrid

What i've done:我做了什么:

I am using stanza NER to find locations from text.我正在使用 stanza NER 从文本中查找位置。

After that, I am using the indexes of the found entities to get the full address.之后,我使用找到的实体的索引来获取完整地址。 For eg: If A Madrid (Spanish city) is found in text[120:128] i will extract the string text[60:101] to get the full address.例如:如果在 text[120:128] 中找到马德里(西班牙城市),我将提取字符串 text[60:101] 以获取完整地址。

My current code is:我目前的代码是:

##
##STANZA NER FOR LOCATIONS
##
!pip install stanza
#Download the spanish model
import stanza
stanza.download('es') 
#create and run the ner tagger
nlp = stanza.Pipeline(lang='es', processors='tokenize,ner')
text = 'al domicilio social de la Compañía, Avenida de Burgos, 109 - 28050 Madrid (Indicar Asesoría Jurídica – Protección de Datos) '
doc = nlp(text)

#print results of NER tagger
print([ent for ent in doc.ents if ent.type=="LOC"], sep='\n')
print(*[text[int(ent.start_char)-60:int(ent.end_char)+15] for ent in doc.ents if ent.type=="LOC"], sep='\n')

After this, in this particular case, which should be reproducible.在此之后,在这种特殊情况下,这应该是可重现的。 I get the next address.我得到下一个地址。

cilio social de la Compañía, Avenida de Burgos, 109 - 28050 Madrid (Indicar Aseso

Which contains extra "garbage" info --> " cilio social de la Compañía," and "(Indicar Aseso"其中包含额外的“垃圾”信息——>“cilio social de la Compañía”和“(Indicar Aseso”

In the next part of the process,I am using the libpostal library to parse the address as it follows:在该过程的下一部分中,我使用libpostal库来解析地址,如下所示:

!pip install postal
from postal.parser import parse_address
parse_address('Avenida de Burgos, 109 - 28050 Madrid')

Which works reliably in most cases, but only with clean addresses.在大多数情况下,它可以可靠地工作,但仅限于干净的地址。

  [('avenida de burgos', 'road'),
 ('109', 'house_number'),
 ('28050', 'postcode'),
 ('madrid', 'city')]

So, to sum up, I am searching from another tecnique apart from regex to help me discard garbage info from addresses apart from regex.因此,总而言之,我正在从除正则表达式之外的另一种技术中进行搜索,以帮助我丢弃除正则表达式之外的地址中的垃圾信息。 (Libraries which do this if they exist or a new NLP approach ... ) Thanks (如果存在或使用新的 NLP 方法的库则执行此操作......)谢谢

For US address extraction from bulk text:对于从批量文本中提取美国地址:

For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex.对于大量文本中的美国地址,我运气不错,尽管下面的正则表达式并不完美。 It wont work on many of the oddity type addresses and only captures first 5 of the zip.它不适用于许多奇怪类型的地址,并且仅捕获 zip 的前 5 个。

Explanation:解释:

  • ([0-9]{1,6}) - string of 1-5 digits to start off ([0-9]{1,6}) - 开始的 1-5 位数字字符串
  • (.{5,75}) - Any character 5-75 times. (.{5,75}) - 任何字符 5-75 次。 I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.我查看了我感兴趣的地址,绝大多数地址行 1、地址 2 和城市的字符数超过 5 个且低于 60 个字符。
  • (BIG LIST OF AMERICAN STATES AND ABBERVIATIONS) - This is to match on states. (BIG LIST OF AMERICAN STATES AND ABBERVIATIONS) - 这是为了匹配州。 Assumes state names will be Title Case.假设州名将是 Title Case。
  • .{1,2} - designed to accommodate many permutations of ,/s or just /s between the state and the zip .{1,2} - 旨在适应状态和 zip 之间的 ,/s 或仅 /s 的许多排列
  • ([0-9]{5}) - captures first 5 of the zip. ([0-9]{5}) - 捕获 zip 的前 5 个。

text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"

address_regex = r"([0-9]{1,6})(.{5,75}?)((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"

addresses = re.findall(address_regex, text)

addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]然后地址是: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]

You can combine these and remove spaces like so:您可以组合这些并删除空格,如下所示:

for address in addresses:
    out_address = " ".join(address)
    out_address = " ".join(out_address.split())

To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob .然后将其分解为正确的第 1 行、第 2 行等。我建议使用像GoogleLob这样的地址验证 API。 These can take a string and break it into parts.这些可以把一根绳子分成几部分。 There are also some python solutions for this like usaddress还有一些类似usaddress的 python 解决方案

For outside the US对于美国以外

To port this to other locations suggest considering how this could be adapted to your region.要将其移植到其他位置,建议考虑如何适应您所在的地区。 If you have a finite number of states and a similar structure, try looking for someone who has already built the "state" regex for your country.如果您有有限数量的州和类似的结构,请尝试寻找已经为您的国家/地区构建了“州”正则表达式的人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM