[英]Is there any way to expand a column in a pandas Dataframe containing lists and fetch the column names from the list values themselves?
I've converted a nested JSON file to a pandas DataFrame. 我已经将嵌套的JSON文件转换为pandas DataFrame。 Some of the columns now contain lists.
现在,某些列包含列表。 They look like this:
他们看起来像这样:
0 [BikeParking: True, BusinessAcceptsBitcoin: Fa...
1 [BusinessAcceptsBitcoin: False, BusinessAccept...
2 [Alcohol: none, Ambience: {'romantic': False, ...
3 [AcceptsInsurance: False, BusinessAcceptsCredi...
4 [BusinessAcceptsCreditCards: True, Restaurants...
5 [BusinessAcceptsCreditCards: True, ByAppointme...
6 [BikeParking: True, BusinessAcceptsCreditCards...
7 [Alcohol: none, Ambience: {'romantic': False, ...
8 [BusinessAcceptsCreditCards: True]
9 [BikeParking: True, BusinessAcceptsCreditCards...
10 None
.
.
.
144070 [Alcohol: none, Ambience: {'romantic': False, ...
144071 [BikeParking: True, BusinessAcceptsCreditCards...
Name: attributes, dtype: object
and this: 和这个:
0 [Monday 11:0-21:0, Tuesday 11:0-21:0, Wednesda...
1 [Monday 0:0-0:0, Tuesday 0:0-0:0, Wednesday 0:...
2 [Monday 11:0-2:0, Tuesday 11:0-2:0, Wednesday ...
3 [Tuesday 10:0-21:0, Wednesday 10:0-21:0, Thurs...
4 None
144066 None
144067 [Tuesday 8:0-16:0, Wednesday 8:0-16:0, Thursda...
144068 [Tuesday 10:0-17:30, Wednesday 10:0-17:30, Thu...
144069 None
144070 [Monday 11:0-20:0, Tuesday 11:0-20:0, Wednesda...
144071 [Monday 10:0-21:0, Tuesday 10:0-21:0, Wednesda...
Name: hours, dtype: object
Is there any way for me to automatically extract the tags (BikeParking, AcceptsInsurance etc.) and use them as column names while filling the cells with the true/false values. 我有什么办法可以自动提取标签(BikeParking,AcceptsInsurance等)并将其用作列名,同时用真/假值填充单元格。 For the Ambience dict I want to do something like Ambience_romantic and true/false in the cells.
对于Ambience dict,我想在单元格中执行Ambience_romantic和true / false这样的操作。 Similarly, I want to extract the days of the week as Column names and use the hours to fill the cells.
同样,我想将星期几提取为列名,并用小时填充单元格。
Or is there a way to flatten the json data before? 还是之前有办法扁平化json数据? I have tried passing the json data line by line to json_normalize and creating a dataframe from the output but it produces the same result.
我尝试将json数据逐行传递到json_normalize并从输出创建数据框,但它会产生相同的结果。 Maybe I'm doing something wrong?
也许我做错了什么?
Format of Original json (yelp_academic_dataset_business.json): 原始json格式(yelp_academic_dataset_business.json):
{
"business_id":"encrypted business id",
"name":"business name",
"neighborhood":"hood name",
"address":"full address",
"city":"city",
"state":"state -- if applicable --",
"postal code":"postal code",
"latitude":latitude,
"longitude":longitude,
"stars":star rating, rounded to half-stars,
"review_count":number of reviews,
"is_open":0/1 (closed/open),
"attributes":["an array of strings: each array element is an attribute"],
"categories":["an array of strings of business categories"],
"hours":["an array of strings of business hours"],
"type": "business"
}
My inital attempt with json_normalize: 我对json_normalize的尝试:
with open('yelp_academic_dataset_business.json') as f:
#Normalize the json data to flatten it and store output in a dataframe
frame= json_normalize([json.loads(line) for line in f])
#write the dataframe to a csv file
frame.to_csv('yelp_academic_dataset_business.csv', encoding='utf-8', index=False)
What I'm currently trying: 我目前正在尝试:
with open(json_filename) as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
df = read_json(data_json_str)
#Now Looking to expand df['attributes'] and others here
And I should also mention my aim is to convert it to .csv to load it into a Database. 我还应该提到我的目标是将其转换为.csv以将其加载到数据库中。 I don't want lists in my database columns.
我不想在数据库列中列出。
You can get the original json data from the Yelp Dataset Challenge site: https://www.yelp.ca/dataset_challenge/dataset 您可以从Yelp数据集挑战网站获取原始json数据: https : //www.yelp.ca/dataset_challenge/dataset
You're trying to convert "documents" (semi-structured data) into a table. 您正在尝试将“文档”(半结构化数据)转换为表。 This could be problematic if one record contains eg 100 attributes which no other records have--you probably don't want to add 100 columns to a master table and have empty cells for all other records.
如果一条记录包含例如100个其他记录都没有的属性,则可能会出现问题-您可能不想在主表中添加100列,并为所有其他记录使用空单元格。
But in the end you have explained that you intend to do this: 但最后,您已经解释了您打算这样做:
And I am here to tell you that this is all entirely pointless. 我在这里告诉您,这完全没有意义。 Mashing the data through all these intermediate formats will only cause problems.
通过所有这些中间格式来混搭数据只会引起问题。
Instead, let's get back to basics: 相反,让我们回到基础:
Now the first step is coming up with a schema. 现在第一步是提出一个模式。 Or, if you're using a NoSQL database, you can directly load the JSON with no other steps required.
或者,如果您使用的是NoSQL数据库,则可以直接加载JSON,而无需其他步骤。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.