简体   繁体   English

如何从JSON列表列表创建padas.DataFrame

[英]How to create padas.DataFrame from list of list of JSON

I have pandas DataFrame from CSV ( gist with small sample ): 我有来自CSV的pandas DataFrame( 要点有小样本 ):

|  title   |                       genres               |
--------------------------------------------------------
| %title1% |[{id: 1, name: '...'}, {id: 2, name: '...'}]|
| %title2% |[{id: 2, name: '...'}, {id: 4, name: '...'}]|
...
| %title9% |[{id: 3, name: '...'}, {id: 9, name: '...'}]|

Each title can be associated with a various count of the genres (more or greater 1). 每个title都可以与各种类型的流派(大于或大于1)相关联。

The task is to convert arrays from genre column into columns and put ones (or True s) for each genre: 任务是将数组从genre列转换为列,并为每种流派放置一个(或True ):

|  title   | genre_1 | genre_2 | genre_3 | ... | genre_9 |
---------------------------------------------------------
| %title1% |    1    |    1    |    0    | ... |    0    |
| %title2% |    1    |    0    |    0    | ... |    0    |
...
| %title9% |    0    |    0    |    1    | ... |    1    |

Genres are the constant set (about 20 items in that set). 流派是常数集(该集中的约20个项目)。

Naive method is: 天真的方法是:

  1. Create the set of all genres 创建所有流派的集合
  2. Create columns for each genre filled with 0 为每种类型创建填充0的列
  3. For each row, in the DataFrame check if some of the genres are in the genres columns and fill the column for that genre with 1. 对于每一行,在DataFrame中检查某些类型是否在genres列中,然后用1填充该genres列。

This approach looks a bit weird. 这种方法看起来有点怪异。

I think that pandas have a more suitable method for that. 我认为大熊猫有一种更合适的方法。

As far as I know, there is no way to perform JSON-deserialization on a Pandas dataframe in a vectorized fashion. 据我所知,没有办法以矢量化方式对Pandas数据帧执行JSON反序列化。 One way you ought to be able to do this is with .iterrows() which will let you do this in one loop (albeit slower than most built-in pandas operations). 您应该能够执行此操作的一种方法是使用.iterrows() ,它可以让您在一个循环中执行此操作(尽管比大多数内置的熊猫操作要慢)。

import json

df = # ... your dataframe

for index, row in df.iterrows():
    # deserialize the JSON string
    json_data = json.loads(row['genres'])

    # add a new column for each of the genres (Pandas is okay with it being sparse)
    for genre in json_data:
        df.loc[index, genre['name']] = 1  # update the row in the df itself

df.drop(['genres'], axis=1, inplace=True)

Note that empty cells with be filled with NaN , not 0 -- you should use .fillna() to change this. 请注意,用NaN而不是0填充的空单元格-您应该使用.fillna()来更改此值。 A brief example with a vaguely similar dataframe looks like 带有近似相似数据框的简短示例如下所示:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([{'title': 'hello', 'json': '{"foo": "bar"}'}, {'title': 'world', 'json': '{"foo": "bar", "ba
   ...: z": "boo"}'}])

In [3]: df.head()
Out[3]:
                           json  title
0                {"foo": "bar"}  hello
1  {"foo": "bar", "baz": "boo"}  world

In [4]: import json
   ...: for index, row in df.iterrows():
   ...:     data = json.loads(row['json'])
   ...:     for k, v in data.items():
   ...:         df.loc[index, k] = v
   ...: df.drop(['json'], axis=1, inplace=True)

In [5]: df.head()
Out[5]:
   title  foo  baz
0  hello  bar  NaN
1  world  bar  boo

If your csv data looks like this. 如果您的csv数据如下所示。

(i added the quotes to the keys of genres json just to work easily with json package. Since it is not the main problem you can do that as preprocessing) (我将引号添加到流派json的键中只是为了轻松使用json包。因为这不是主要问题,所以可以将其作为预处理来处理)

在此处输入图片说明

You will have to iterate through all the rows of input DataFrame . 您将必须遍历输入DataFrame的所有行。

for index, row in inputDf.iterrows():
    fullDataFrame = pd.concat([fullDataFrame, get_dataframe_for_a_row(row)])

in get_dataframe_for_a_row function: 在get_dataframe_for_a_row函数中:

  • prepare a DataFrame with column title and value row['title'] 准备一个具有列标题和值行['title']的DataFrame
  • add columns with names formed by appending id to 'genre_'. 添加具有通过将id附加到“ genre_”而形成的名称的列。
  • assign them value of 1 给他们赋值1

and then build a DataFrame for each row and concat them to a full DataFrame . 然后为每一行构建一个DataFrame,并将它们连接为完整的DataFrame。 pd.concat() concatenates the dataframe obtained from each row. pd.concat()连接从每一行获得的数据帧。 will merge the comumns if already exist. 如果已经存在,将合并公社。

finally, fullDataFrame.fillna(0) to replace NaN with 0 最后, fullDataFrame.fillna(0)将NaN替换为0

your final DataFrame will look like this. 您最终的DataFrame将如下所示。 在此处输入图片说明

here is the full code: 这是完整的代码:

import pandas as pd
import json

inputDf = pd.read_csv('title_genre.csv')

def labels_for_genre(a):
    a[0]['id']
    labels = []
    for i in range(0 , len(a)):
        label = 'genre'+'_'+str(a[i]['id'])
        labels.append(label)
    return labels

def get_dataframe_for_a_row(row): 
    labels = labels_for_genre(json.loads(row['genres']))
    tempDf = pd.DataFrame()
    tempDf['title'] = [row['title']]
    for label in labels:
        tempDf[label] = ['1']
    return tempDf

fullDataFrame = pd.DataFrame()
for index, row in inputDf.iterrows():
    fullDataFrame = pd.concat([fullDataFrame, get_dataframe_for_a_row(row)])
fullDataFrame = fullDataFrame.fillna(0)

Full working solution without iterrows : 完整的工作解决方案,没有iterrows

import pandas as pd
import itertools
import json

# read data
movies_df = pd.read_csv('https://gist.githubusercontent.com/feeeper/9c7b1e8f8a4cc262f17675ef0f6e1124/raw/022c0d45c660970ca55e889cd763ce37a54cc73b/example.csv', converters={ 'genres': json.loads })

# get genres for all items
all_genres_entries = list(itertools.chain.from_iterable(movies_df['genres'].values))

# create the list with unique genres
genres = list({v['id']:v for v in all_genres_entries}.values())

# fill genres columns
for genre in genres:
    movies_df['genre_{}'.format(genre['id'])] = movies_df['genres'].apply(lambda x: 1 if genre in x else 0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM