Consider the following DataFrame:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3],
'json_col': [ [{'aa' : 1, 'ab' : 1}, {'aa' : 3, 'ab' : 2, 'ac': 6}],
[{'aa' : 1, 'ab' : 2, 'ac': 1}, {'aa' : 5}],
[{'aa': 3, 'ac': 2}] ]})
df
Out[134]:
id json_col
0 1 [{'aa': 1, 'ab': 1}, {'aa': 3, 'ab': 2, 'ac': 6}]
1 2 [{'aa': 1, 'ab': 2, 'ac': 1}, {'aa': 5}]
2 3 [{'aa': 3, 'ac': 2}]
We can see that we have a list of jsons for each id.
I'd like, for each 'id'
and for each corresponding json in its list, to have a 'row'
in the DataFrame
. So the following DataFrame
will look like this:
id aa ab ac
0 1 1 1.0 NaN
1 1 3 2.0 6.0
2 2 1 2.0 1.0
3 2 5 NaN NaN
4 3 3 NaN 2.0
We can see, id '1'
had 2 corresponding jsons in it's list and therefor it gets 2 rows in the new DataFrame
Is there a pythonic way to do so using panda, numpy or json functionality?
setup = """
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3],
'json_col': [ [{'aa' : 1, 'ab' : 1}, {'aa' : 3, 'ab' : 2, 'ac': 6}],
[{'aa' : 1, 'ab' : 2, 'ac': 1}, {'aa' : 5}],
[{'aa': 3, 'ac': 2}] ]})
"""
s1 = """
df = pd.concat(
[pd.DataFrame(j, index=[i]*len(j)) for i, j in enumerate(df['json_col'], 1)],
sort=False
)
"""
s2 = """
recs = df.apply(lambda x: [{**{'id': x.id}, **d} for d in x.json_col], axis=1).sum()
df2 = pd.DataFrame.from_records(recs)
"""
%timeit(s1, setup)
52.3 ns ± 2.6 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit(s2, setup)
50.6 ns ± 3.28 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
a short way to accomplish this would be the following, although I don't personally consider it very pythonic as the code is a little hard to read, and not terribly performant, but for small data wrangling this should do the trick:
recs = df.apply(lambda x: [{**{'id': x.id}, **d} for d in x.json_col], axis=1).sum()
df2 = pd.DataFrame.from_records(recs)
# outputs:
aa ab ac id
0 1 1.0 NaN 1
1 3 2.0 6.0 1
2 1 2.0 1.0 2
3 5 NaN NaN 2
4 3 NaN 2.0 3
The applied lambda creates a new dictionary by merging the contents of {id: x.id}
to each dictionary in the list of dictionaries in x.json_col
(where x is a row).
This is then summed. Since summing a lists of list of elements unites them into a big list of elements, recs has the following form
[{'id': 1, 'aa': 1, 'ab': 1}, {'id': 1, 'aa': 3, 'ab': 2, 'ac': 6}, {'id': 2, 'aa': 1, 'ab': 2, 'ac': 1}, {'id': 2, 'aa': 5}, {'id': 3, 'aa': 3, 'ac': 2}]
A new data frame is then simply constructed from the records.
Here is one quick way by converting all the json_col
's lists of dictionaries to DataFrame
and concatenating them together plus some tweaks to create the id
column:
In [51]: df = pd.concat(
[pd.DataFrame(j, index=[i]*len(j)) for i, j in enumerate(json_col, 1)],
sort=False
)
In [52]: df.index.name = 'id'
In [53]: df.reset_index()
Out[53]:
id aa ab ac
0 1 1 1.0 NaN
1 1 3 2.0 6.0
2 2 1 2.0 1.0
3 2 5 NaN NaN
4 3 3 NaN 2.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.