有什么方法可以在包含列表的pandas Dataframe中扩展列，并从列表值本身获取列名称？

Question

我已经将嵌套的JSON文件转换为pandas DataFrame。 现在，某些列包含列表。 他们看起来像这样：

0         [BikeParking: True, BusinessAcceptsBitcoin: Fa...
1         [BusinessAcceptsBitcoin: False, BusinessAccept...
2         [Alcohol: none, Ambience: {'romantic': False, ...
3         [AcceptsInsurance: False, BusinessAcceptsCredi...
4         [BusinessAcceptsCreditCards: True, Restaurants...
5         [BusinessAcceptsCreditCards: True, ByAppointme...
6         [BikeParking: True, BusinessAcceptsCreditCards...
7         [Alcohol: none, Ambience: {'romantic': False, ...
8                        [BusinessAcceptsCreditCards: True]
9         [BikeParking: True, BusinessAcceptsCreditCards...
10                                                     None
.
.
.
144070    [Alcohol: none, Ambience: {'romantic': False, ...
144071    [BikeParking: True, BusinessAcceptsCreditCards...
Name: attributes, dtype: object

和这个：

0         [Monday 11:0-21:0, Tuesday 11:0-21:0, Wednesda...
1         [Monday 0:0-0:0, Tuesday 0:0-0:0, Wednesday 0:...
2         [Monday 11:0-2:0, Tuesday 11:0-2:0, Wednesday ...
3         [Tuesday 10:0-21:0, Wednesday 10:0-21:0, Thurs...
4                                                      None

144066                                                 None
144067    [Tuesday 8:0-16:0, Wednesday 8:0-16:0, Thursda...
144068    [Tuesday 10:0-17:30, Wednesday 10:0-17:30, Thu...
144069                                                 None
144070    [Monday 11:0-20:0, Tuesday 11:0-20:0, Wednesda...
144071    [Monday 10:0-21:0, Tuesday 10:0-21:0, Wednesda...
Name: hours, dtype: object

我有什么办法可以自动提取标签（BikeParking，AcceptsInsurance等）并将其用作列名，同时用真/假值填充单元格。 对于Ambience dict，我想在单元格中执行Ambience_romantic和true / false这样的操作。 同样，我想将星期几提取为列名，并用小时填充单元格。

还是之前有办法扁平化json数据？ 我尝试将json数据逐行传递到json_normalize并从输出创建数据框，但它会产生相同的结果。 也许我做错了什么？

原始json格式（yelp_academic_dataset_business.json）：

 {
    "business_id":"encrypted business id",
    "name":"business name",
    "neighborhood":"hood name",
    "address":"full address",
    "city":"city",
    "state":"state -- if applicable --",
    "postal code":"postal code",
    "latitude":latitude,
    "longitude":longitude,
    "stars":star rating, rounded to half-stars,
    "review_count":number of reviews,
    "is_open":0/1 (closed/open),
    "attributes":["an array of strings: each array element is an attribute"],
    "categories":["an array of strings of business categories"],
    "hours":["an array of strings of business hours"],
    "type": "business"
}

我对json_normalize的尝试：

with open('yelp_academic_dataset_business.json') as f:
        #Normalize the json data to flatten it and store output in a dataframe
        frame= json_normalize([json.loads(line) for line in f])

        #write the dataframe to a csv file
        frame.to_csv('yelp_academic_dataset_business.csv', encoding='utf-8', index=False)

我目前正在尝试：

with open(json_filename) as f:
    data = f.readlines()

    # remove the trailing "\n" from each line
    data = map(lambda x: x.rstrip(), data)

    data_json_str = "[" + ','.join(data) + "]"  

    df = read_json(data_json_str)
    #Now Looking to expand df['attributes'] and others here

我还应该提到我的目标是将其转换为.csv以将其加载到数据库中。 我不想在数据库列中列出。

您可以从Yelp数据集挑战网站获取原始json数据： https ： //www.yelp.ca/dataset_challenge/dataset

Answer 1

您正在尝试将“文档”（半结构化数据）转换为表。 如果一条记录包含例如100个其他记录都没有的属性，则可能会出现问题-您可能不想在主表中添加100列，并为所有其他记录使用空单元格。

但最后，您已经解释了您打算这样做：

加载JSON。
转换为熊猫。
导出CSV。
导入数据库。

我在这里告诉您，这完全没有意义。 通过所有这些中间格式来混搭数据只会引起问题。

相反，让我们回到基础：

加载JSON。
写入数据库。

现在第一步是提出一个模式。 或者，如果您使用的是NoSQL数据库，则可以直接加载JSON，而无需其他步骤。

有什么方法可以在包含列表的pandas Dataframe中扩展列，并从列表值本身获取列名称？

问题描述

1 个解决方案

解决方案1
0 2017-02-18 05:24:44

有什么方法可以在包含列表的pandas Dataframe中扩展列，并从列表值本身获取列名称？

问题描述

1 个解决方案

解决方案1 0 2017-02-18 05:24:44

解决方案1
0 2017-02-18 05:24:44