[英]pandas column containing list of objects, split this column based upon keynames and store values as comma separated values
I have a dataframe which contains column: 我有一个包含列的数据框:
A
[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}]
[{"A": 31, "B": "hij"},{"A": 32, "B": "abc"}]
[{"A": 28, "B": "abc"}]
[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}]
[{"A": 28, "B": "abc"},{"A": 29, "B": "klm"},{"A": 30, "B": "nop"}]
[{"A": 28, "B": "abc"},{"A": 29, "B": "xyz"}]
Output should be such that: 输出应为:
A B
28,29,30 abc,def,hij
31,32 hij,abc
28 abc
28,29,30 abc,def,hij
28,29,30 abc,klm,nop
28,29 abc,xyz
How do i split the list of objects into columns depending on key names and have them stored as comma separated values as shown above. 我如何根据键名将对象列表分为几列,并将它们存储为逗号分隔的值,如上所示。
By using stack
then groupby
通过使用
stack
然后groupby
df.A.apply(pd.Series).stack().\
apply(pd.Series).groupby(level=0).\
agg(lambda x :','.join(x.astype(str)))
Out[457]:
A B
0 28,29,30 abc,def,hij
1 31,32 hij,abc
2 28 abc
3 28,29,30 abc,def,hij
4 28,29,30 abc,klm,nop
Data input: 数据输入:
df=pd.DataFrame({'A':[[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}],
[{"A": 31, "B": "hij"},{"A": 32, "B": "abc"}],
[{"A": 28, "B": "abc"}],[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}],
[{"A": 28, "B": "abc"},{"A": 29, "B": "klm"},{"A": 30, "B": "nop"}]]})
For your additional question read from csv 对于您的其他问题,请从csv中阅读
import ast
df=pd.read_csv(r'your.csv',dtype={'A':object})
df['A'] = df['A'].apply(ast.literal_eval)
I was assuming A
was a list of list of dicts 我以为
A
是字典列表
A = [
[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}],
[{"A": 31, "B": "hij"},{"A": 32, "B": "abc"}],
[{"A": 28, "B": "abc"}],
[{"A": 28, "B": "abc"},{"A": 29, "B": "def"},{"A": 30, "B": "hij"}],
[{"A": 28, "B": "abc"},{"A": 29, "B": "klm"},{"A": 30, "B": "nop"}],
[{"A": 28, "B": "abc"},{"A": 29, "B": "xyz"}]
]
First thing I'd do is use comprehensions to create a new dictionary. 我要做的第一件事是使用理解能力来创建新词典。 Then
','.join
within a groupby
然后
','.join
groupby
B = {
(i, j, k): v
for j, row in enumerate(A)
for i, d in enumerate(row)
for k, v in d.items()
}
pd.Series(B).astype(str).groupby(level=[1, 2]).apply(','.join).unstack()
A B
0 28,29,30 abc,def,hij
1 31,32 hij,abc
2 28 abc
3 28,29,30 abc,def,hij
4 28,29,30 abc,klm,nop
5 28,29 abc,xyz
Thought I'll take a shot at this. 以为我会为此开枪。 First, never use
eval
where you can avoid it. 首先, 切勿在可以避免使用
eval
地方使用它。 A better solution would be using ast
: 更好的解决方案是使用
ast
:
import ast
df.A = df.A.apply(ast.literal_eval)
Next, flatten your columns: 接下来,将您的列展平:
i = df.A.str.len().cumsum() # we'll need this later
df = pd.DataFrame.from_dict(np.concatenate(df.A).tolist())
df.A = df.A.astype(str)
df
A B
0 28 abc
1 29 def
2 30 hij
3 31 hij
4 32 abc
5 28 abc
6 28 abc
7 29 def
8 30 hij
9 28 abc
10 29 klm
11 30 nop
12 28 abc
13 29 xyz
Now, perform a groupby
using intervals made from i
. 现在,使用由
i
间隔执行groupby
。
idx = pd.cut(df.index, bins=np.append([0], i), include_lowest=True, right=False)
df = df.groupby(idx, as_index=False).agg(','.join)
df
A B
0 28,29,30 abc,def,hij
1 31,32 hij,abc
2 28 abc
3 28,29,30 abc,def,hij
4 28,29,30 abc,klm,nop
5 28,29 abc,xyz
Had a little help from Bharath here . 在这里有巴拉特的帮助。
A cool alternative to the IntervalIndex
( proposed by Wen ) involves the use of np.put
: IntervalIndex
( 由Wen提出 )的一个很酷的替代方案是使用np.put
:
i = df.A.str.len().cumsum()
df = pd.DataFrame.from_dict(np.concatenate(df.A).tolist())
df.A = df.A.astype(str)
v = pd.Series(0, index=df.index)
np.put(v, i-1, [1] * len(i))
df = df.groupby(v[::-1].cumsum()).agg(','.join)[::-1].reset_index(drop=True)
df
A B
0 28,29,30 abc,def,hij
1 31,32 hij,abc
2 28 abc
3 28,29,30 abc,def,hij
4 28,29,30 abc,klm,nop
5 28,29 abc,xyz
df = pd.concat([df] * 1000, ignore_index=True)
%%timeit
df.A.apply(pd.Series).stack().\
apply(pd.Series).groupby(level=0).\
agg(lambda x :','.join(x.astype(str)))
1 loop, best of 3: 8.76 s per loop
%%timeit
A = df.A.values.tolist()
B = {
(i, j, k): v
for j, row in enumerate(A)
for i, d in enumerate(row)
for k, v in d.items()
}
pd.Series(B).astype(str).groupby(level=[1, 2]).apply(','.join).unstack()
1 loop, best of 3: 2.08 s per loop
%%timeit
i = df.A.str.len().cumsum()
df2 = pd.DataFrame.from_dict(np.concatenate(df.A).tolist())
df2.A = df2.A.astype(str)
idx = pd.cut(df2.index, bins=np.append([0], i), include_lowest=True, right=False)
df2.groupby(idx, as_index=False).agg(','.join)
1 loop, best of 3: 810 ms per loop
%%timeit
i = df.A.str.len().cumsum()
df2 = pd.DataFrame.from_dict(np.concatenate(df.A).tolist())
df2.A = df2.A.astype(str)
v = pd.Series(0, index=df2.index)
np.put(v, i-1, [1] * len(i))
df2.groupby(v[::-1].cumsum()).agg(','.join)[::-1].reset_index(drop=True)
1 loop, best of 3: 548 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.