简体   繁体   English

熊猫,将元组列表列表转换为DataFrame尴尬的列标题。

[英]Pandas, turn list of lists of tuples into DataFrame awkward column headers.

I have data from parsed addresses that I obtained from the usaddress python library: https://github.com/datamade/usaddress 我有从我从usaddress python库获得的解析地址中获得的数据: https : //github.com/datamade/usaddress

The data is a list of lists of tuples. 数据是元组列表的列表。 Each address has a list like this associated with it: 每个地址都有一个与此相关的列表:

[('Robie', 'BuildingName'),
('House,', 'BuildingName'),
('5757', 'AddressNumber'),
('South', 'StreetNamePreDirectional'),
('Woodlawn', 'StreetName'),
('Avenue,', 'StreetNamePostType'),
('Chicago,', 'PlaceName'),
('IL', 'StateName'),
('60637', 'ZipCode')]

However, for some addresses a certain field may, or may not be present. 但是,对于某些地址,某些字段可能存在也可能不存在。 I want to export this data into a pandas DataFrame with all the column headers (BuildingName, Address...ect) and if that that column header isn't present in the list, then the cell is just left blank. 我想将此数据导出到具有所有列标题(BuildingName,Address ... ect)的pandas DataFrame中,如果列表中不存在该列标题,则该单元格将保留为空白。

What I have at the moment is: 我目前所拥有的是:

newAddr = []
for index, row in df.iterrows():
    newAddr.append(usaddr.parse(row['FullAddress']))

df2 = DataFrame(newAddr)

But this produces a file with no column headers and no real organization by column, since the missing values just shift everything over. 但这会产生一个没有列标题并且没有按列进行实际组织的文件,因为缺少的值只会将所有内容移过来。

Help is greatly appreciated. 非常感谢您的帮助。

Assuming the following: 假设以下内容:

  • You use usaddress.tag 您使用usaddress.tag
  • have ways to handle the errors that may be raised from usaddress.tag 有办法处理可能由usaddress.tag引发的错误
  • only want the first part of the return from usaddress.tag 只希望从usaddress.tag返回的第一部分

Then, you can do the following 然后,您可以执行以下操作

import usaddress
import pandas as pd

# your list of addresses dataframe
df = pd.read_csv('PATH_TO_ADDRESS_CSV')

# list of orderedDict
ordered_dicts = []

# loop through addresses and get respective information
for index, row in df.iterrows():
    # here you should try/except for cases that fail
    addr = usaddress.tag(row['FullAddress'])

    # append to list
    ordered_dicts.append(addr[0])

# **get all relevant keys in your list
cols = set().union(*(d.keys() for d in ordered_dicts))

# create new dataframe
df_new = pd.DataFrame(ordered_dicts, columns=cols)

df_new.to_csv('PATH_TO_DESIRED_CSV_ENDPOINT')

The ** represents an alternative solution to this part of the function. **代表此功能部分的替代解决方案。 Because we know exactly all the columns that the .tag function can return, you can just initially set the columns as such (see all tags here and API here ): 因为我们完全知道.tag函数可以返回的所有列,所以您可以像这样初始设置这些列(请参阅此处的所有标签和此处的 API):

cols = ['AddressNumberPrefix', 'AddressNumber', ...]

I hope this helps! 我希望这有帮助! Know that when you do pd.DataFrame with dictionaries and specify exact columns, it will automatically fill in the non-existing keys with pd.NaN . 知道当您使用字典处理pd.DataFrame并指定确切的列时,它将自动用pd.NaN填充不存在的键。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM