简体   繁体   中英

Subsetting an address as house number, street, city, state in python

I have 1 Billion addresses which are kinda in a bad format like:

'12-as FS street, 456 DLGG Area, Rand. District, Sydney, Australia 32 1020203'

I need the output like

Column1:12AS
Column2: FS 456 DLGG Area
Column3: Rand
Column4: Sydney
Column5: Australia
Column6: 32
Column7: 1020203

So basically i need them to be separated as house number, address line, state, country, statecode, pincode and remove words like street, district, countryside, road etc .

Also I need to search for the most frequent words above a particular threshold.

You just need to write a parser. Its code would depend on data. Unless somebody has written parser for your specific data format.

List of immediate questions (incomplete): 1) Is comma the separator for all lines? 2) Is comma used inside values (eg inside street name)? 3) List of all words to be removed (road, rd., blvd. etc.) 4) Can address be in the form of "house name" instead of street with number?

This is a random example of address parser with some learning functionality: https://github.com/datamade/usaddress

If your format and requirements are not exactly matching some existing parser, then you have to write on your own.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM