简体   繁体   中英

Data cleaning in a csv file using python

Hi I am working on a csv file having several columns. One particular column is address which is in the below format -

10515, 115th Place Northeast, Juanita, Kirkland, King County, Washington, 98033, United States of America

I want to split each column based on (,) and create new relevant columns for each like Unit, Street, state, post code etc...

I was able to split them based on (,) and now I have one column for each split.

The problem is that this data is not consistent, the total columns that I get after split is 10. But the data is not in the same order. Some records are like the following -

3008, 38th Avenue Southwest, West Seattle, Seattle, King County, Washington, 98126, United States of America

23098, Northeast 130th Street, Trilogy, Union Hill-Novelty Hill, Novelty, King County, Washington, 98053, United States of America

Fire Station 34, 633, 32nd Avenue East, Broadmoor, Washington Park, Seattle, King County, Washington, 98112, United States of America

Basically, not each record will have all the 10 kinds of information and not necessarily in the same order.

What should be the best approach to clean this type of data? I want to eventually have the data put in the relevant column according to what they represent, like if city go under city column, if postcode move to postcode columns etc.

I am using Python 2.0.

Hoping to get a good solution. Thanks!

I would use the library usaddress to decompose an address into its constituent parts.

https://usaddress.readthedocs.io/en/latest/

>>> import usaddress
>>> usaddress.tag('Robie House, 5757 South Woodlawn Avenue, Chicago, IL 60637')
(OrderedDict([
   ('BuildingName', 'Robie House'),
   ('AddressNumber', '5757'),
   ('StreetNamePreDirectional', 'South'),
   ('StreetName', 'Woodlawn'),
   ('StreetNamePostType', 'Avenue'),
   ('PlaceName', 'Chicago'),
   ('StateName', 'IL'),
   ('ZipCode', '60637')]),
'Street Address')

>>> usaddress.tag('State & Lake, Chicago')
(OrderedDict([
   ('StreetName', 'State'),
   ('IntersectionSeparator', '&'),
   ('SecondStreetName', 'Lake'),
   ('PlaceName', 'Chicago')]),
'Intersection')

>>> usaddress.tag('P.O. Box 123, Chicago, IL')
(OrderedDict([
   ('USPSBoxType', 'P.O. Box'),
   ('USPSBoxID', '123'),
   ('PlaceName', 'Chicago'),
   ('StateName', 'IL')]),
'PO Box')

from there, you can query the returned dictionary and readily put it in your pandas DataFrame.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM