[英]How to create padas.DataFrame from list of list of JSON
I have pandas DataFrame from CSV ( gist with small sample ): 我有来自CSV的pandas DataFrame( 要点有小样本 ):
| title | genres |
--------------------------------------------------------
| %title1% |[{id: 1, name: '...'}, {id: 2, name: '...'}]|
| %title2% |[{id: 2, name: '...'}, {id: 4, name: '...'}]|
...
| %title9% |[{id: 3, name: '...'}, {id: 9, name: '...'}]|
Each title
can be associated with a various count of the genres (more or greater 1). 每个
title
都可以与各种类型的流派(大于或大于1)相关联。
The task is to convert arrays from genre
column into columns and put ones (or True
s) for each genre: 任务是将数组从
genre
列转换为列,并为每种流派放置一个(或True
):
| title | genre_1 | genre_2 | genre_3 | ... | genre_9 |
---------------------------------------------------------
| %title1% | 1 | 1 | 0 | ... | 0 |
| %title2% | 1 | 0 | 0 | ... | 0 |
...
| %title9% | 0 | 0 | 1 | ... | 1 |
Genres are the constant set (about 20 items in that set). 流派是常数集(该集中的约20个项目)。
Naive method is: 天真的方法是:
genres
columns and fill the column for that genre with 1. genres
列中,然后用1填充该genres
列。 This approach looks a bit weird. 这种方法看起来有点怪异。
I think that pandas have a more suitable method for that. 我认为大熊猫有一种更合适的方法。
As far as I know, there is no way to perform JSON-deserialization on a Pandas dataframe in a vectorized fashion. 据我所知,没有办法以矢量化方式对Pandas数据帧执行JSON反序列化。 One way you ought to be able to do this is with
.iterrows()
which will let you do this in one loop (albeit slower than most built-in pandas operations). 您应该能够执行此操作的一种方法是使用
.iterrows()
,它可以让您在一个循环中执行此操作(尽管比大多数内置的熊猫操作要慢)。
import json
df = # ... your dataframe
for index, row in df.iterrows():
# deserialize the JSON string
json_data = json.loads(row['genres'])
# add a new column for each of the genres (Pandas is okay with it being sparse)
for genre in json_data:
df.loc[index, genre['name']] = 1 # update the row in the df itself
df.drop(['genres'], axis=1, inplace=True)
Note that empty cells with be filled with NaN
, not 0 -- you should use .fillna()
to change this. 请注意,用
NaN
而不是0填充的空单元格-您应该使用.fillna()
来更改此值。 A brief example with a vaguely similar dataframe looks like 带有近似相似数据框的简短示例如下所示:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([{'title': 'hello', 'json': '{"foo": "bar"}'}, {'title': 'world', 'json': '{"foo": "bar", "ba
...: z": "boo"}'}])
In [3]: df.head()
Out[3]:
json title
0 {"foo": "bar"} hello
1 {"foo": "bar", "baz": "boo"} world
In [4]: import json
...: for index, row in df.iterrows():
...: data = json.loads(row['json'])
...: for k, v in data.items():
...: df.loc[index, k] = v
...: df.drop(['json'], axis=1, inplace=True)
In [5]: df.head()
Out[5]:
title foo baz
0 hello bar NaN
1 world bar boo
If your csv data looks like this. 如果您的csv数据如下所示。
(i added the quotes to the keys of genres json just to work easily with json package. Since it is not the main problem you can do that as preprocessing) (我将引号添加到流派json的键中只是为了轻松使用json包。因为这不是主要问题,所以可以将其作为预处理来处理)
You will have to iterate through all the rows of input DataFrame . 您将必须遍历输入DataFrame的所有行。
for index, row in inputDf.iterrows():
fullDataFrame = pd.concat([fullDataFrame, get_dataframe_for_a_row(row)])
in get_dataframe_for_a_row function: 在get_dataframe_for_a_row函数中:
and then build a DataFrame for each row and concat them to a full DataFrame . 然后为每一行构建一个DataFrame,并将它们连接为完整的DataFrame。 pd.concat() concatenates the dataframe obtained from each row.
pd.concat()连接从每一行获得的数据帧。 will merge the comumns if already exist.
如果已经存在,将合并公社。
finally, fullDataFrame.fillna(0)
to replace NaN with 0 最后,
fullDataFrame.fillna(0)
将NaN替换为0
your final DataFrame will look like this. 您最终的DataFrame将如下所示。
here is the full code: 这是完整的代码:
import pandas as pd
import json
inputDf = pd.read_csv('title_genre.csv')
def labels_for_genre(a):
a[0]['id']
labels = []
for i in range(0 , len(a)):
label = 'genre'+'_'+str(a[i]['id'])
labels.append(label)
return labels
def get_dataframe_for_a_row(row):
labels = labels_for_genre(json.loads(row['genres']))
tempDf = pd.DataFrame()
tempDf['title'] = [row['title']]
for label in labels:
tempDf[label] = ['1']
return tempDf
fullDataFrame = pd.DataFrame()
for index, row in inputDf.iterrows():
fullDataFrame = pd.concat([fullDataFrame, get_dataframe_for_a_row(row)])
fullDataFrame = fullDataFrame.fillna(0)
Full working solution without iterrows
: 完整的工作解决方案,没有
iterrows
:
import pandas as pd
import itertools
import json
# read data
movies_df = pd.read_csv('https://gist.githubusercontent.com/feeeper/9c7b1e8f8a4cc262f17675ef0f6e1124/raw/022c0d45c660970ca55e889cd763ce37a54cc73b/example.csv', converters={ 'genres': json.loads })
# get genres for all items
all_genres_entries = list(itertools.chain.from_iterable(movies_df['genres'].values))
# create the list with unique genres
genres = list({v['id']:v for v in all_genres_entries}.values())
# fill genres columns
for genre in genres:
movies_df['genre_{}'.format(genre['id'])] = movies_df['genres'].apply(lambda x: 1 if genre in x else 0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.