简体   繁体   中英

Flatten pandas data frame with a JSON column

I have a very large dataset in CSV format in which one column is a JSON string. I want to read this information into a flat Pandas data frame. How can I achieve this efficiently?

Input CSV:

col1,col2,col3,col4
1,Programming,"{""col3_1"":null,""col3_2"":""Java""}",11
2,Sport,"{""col3_1"":null,""col3_2"":""Soccer""}",22
3,Food,"{""col3_1"":null,""col3_2"":""Pizza""}",33 

Expected DataFrame:

+---------------------------------------------------------------+
|   col1    |    col2     |   col3_1    |   col3_2  |   col4    |
+---------------------------------------------------------------+
|    1      | Programming |    None     |    Java   |    11     |
|    2      |    Sport    |    None     |   Soccer  |    22     |
|    3      |    Food     |    None     |   Pizza   |    33     |
+---------------------------------------------------------------+

I can currently get the expected output using the following code. I just want to know if there is a more efficient way to achieve the same.

import json
import pandas
dataset = pandas.read_csv('/dataset.csv')
dataset['col3'] = dataset['col3'].apply(json.loads)
dataset['col3_1'] = dataset['col3'].apply(lambda row: row['col3_1'])
dataset['col3_2'] = dataset['col3'].apply(lambda row: row['col3_2'])
dataset = dataset.drop(columns=['col3'])

you can parse JSON in Pandas column using json.loads() and convert it to Pandas columns using pd.Series() :

In [85]: df.join(df.pop('col3').apply(lambda x: pd.Series(json.loads(x))))
Out[85]:
   col1         col2  col4 col3_1  col3_2
0     1  Programming    11   None    Java
1     2        Sport    22   None  Soccer
2     3         Food    33   None   Pizza

Use for better performance DataFrame contructor with DataFrame.pop for extract column:

df1 = pd.DataFrame(df.pop('col3').apply(pd.io.json.loads).values.tolist(), index=df.index)
df = df.join(df1)
print (df)
   col1         col2  col4 col3_1  col3_2
0     1  Programming    11   None    Java
1     2        Sport    22   None  Soccer
2     3         Food    33   None   Pizza

Detail :

print (df.pop('col3').apply(pd.io.json.loads))
0      {'col3_1': None, 'col3_2': 'Java'}
1    {'col3_1': None, 'col3_2': 'Soccer'}
2     {'col3_1': None, 'col3_2': 'Pizza'}
Name: col3, dtype: object

print (pd.DataFrame(df.pop('col3').apply(pd.io.json.loads).values.tolist(), index=df.index))
  col3_1  col3_2
0   None    Java
1   None  Soccer
2   None   Pizza

Solutions are similar, but performance is different:

df = pd.concat([df] * 10000, ignore_index=True)

In [204]: %timeit df.join(pd.DataFrame(df['col3'].apply(pd.io.json.loads).values.tolist(), index=df.index))
10 loops, best of 3: 76.4 ms per loop

In [205]: %timeit df.join(df['col3'].apply(lambda x: pd.Series(json.loads(x))))
1 loop, best of 3: 11.3 s per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM