[英]How to groupby by a column and return the values of other columns as lists in pandas?
I am facing trouble clubbing the values of a column together and retaining the respective values of other columns. 我遇到麻烦,将列的值合在一起并保留其他列的相应值。 I would like to do something similar to this: grouping rows in list in pandas groupby
我想做类似的事情: 在pandas groupby中对列表中的行进行分组
But instead, I want the list/dictionary (preferably latter) to contain values of multiple columns. 但相反,我希望列表/字典(最好是后者)包含多列的值。 Example for this dataframe:
此数据框的示例:
df: DF:
Col1 Col2 Col3
A xyz 1
A pqr 2
B xyz 2
B pqr 3
B lmn 1
C pqr 2
I want something like- 我想要的东西像 -
A {'xyz':1, 'pqr': 2}
B {'xyz':2, 'pqr': 3, 'lmn': 1}
C {'pqr':2}
I tried doing 我试过了
df.groupby('Col1')[['Col2', 'Col3']].apply(list)
which is a variant of the solution mentioned in the linked post, but isn't giving me the result I need. 这是链接帖子中提到的解决方案的变体,但没有给我我需要的结果。
From that point on, I would also like to transform it into a dataframe of the form: 从那时起,我还想将其转换为以下形式的数据框:
xyz pqr lmn
A 1 2 NaN
B 2 3 1
C NaN 2 NaN
What you want in the end is a pivot table. 你最终想要的是一个数据透视表。
df.pivot_table(index='Col1',columns='Col2',values='Col3')
look up the documentation for more options. 查找文档以获取更多选项。
Use pivot
or unstack
: 使用
pivot
或unstack
:
df = df.pivot(index='Col1',columns='Col2',values='Col3')
print (df)
Col2 lmn pqr xyz
Col1
A NaN 2.0 1.0
B 1.0 3.0 2.0
C NaN 2.0 NaN
df = df.set_index(['Col1','Col2'])['Col3'].unstack()
print (df)
Col2 lmn pqr xyz
Col1
A NaN 2.0 1.0
B 1.0 3.0 2.0
C NaN 2.0 NaN
but if: 但如果:
ValueError: Index contains duplicate entries, cannot reshape
ValueError:索引包含重复的条目,无法重新整形
it means duplicates, need pivot_table
or aggregate with groupby
by mean
(can be changed to sum
, median
), and last reshape by unstack
: 这意味着重复,需要
pivot_table
或骨料与groupby
由mean
(可改为sum
, median
),并最后通过整形unstack
:
print (df)
Col1 Col2 Col3
0 A xyz 1 <-same A, xyz
1 A xyz 5 <-same A, xyz
2 A pqr 2
3 B xyz 2
4 B pqr 3
5 B lmn 1
6 C pqr 2
df = df.groupby(['Col1','Col2'])['Col3'].mean().unstack()
print (df)
Col2 lmn pqr xyz
Col1
A NaN 2.0 3.0 (1+5)/2 = 3
B 1.0 3.0 2.0
C NaN 2.0 NaN
EDIT: 编辑:
For check all duplicates by Col1
and Col2
: 要检查
Col1
和Col2
所有重复项:
print (df[df.duplicated(subset=['Col1','Col2'], keep=False)])
Col1 Col2 Col3
0 A xyz 1
1 A xyz 5
EDIT1: EDIT1:
If need only first row if duplicates: 如果重复,只需要第一行:
df = df.groupby(['Col1','Col2'])['Col3'].first().unstack()
print (df)
Col2 lmn pqr xyz
Col1
A NaN 2.0 1.0
B 1.0 3.0 2.0
C NaN 2.0 NaN
Or better first remove duplicates by drop_duplicates
and then use first or second solution: 或者最好先通过
drop_duplicates
删除重复drop_duplicates
,然后使用第一个或第二个解决方案:
df = df.drop_duplicates(subset=['Col1','Col2'])
df = df.pivot(index='Col1',columns='Col2',values='Col3')
print (df)
Col2 lmn pqr xyz
Col1
A NaN 2.0 1.0
B 1.0 3.0 2.0
C NaN 2.0 NaN
Neither one of these are pandas
only solutions. 这些都不是
pandas
唯一的解决方案。 I provided them because I find exploring alternatives fun. 我提供他们因为我发现探索替代品很有趣。 The
bincount
base solution is very fast but less transparent. bincount
基本解决方案非常快但透明度较低。
Creative Solution 1 创意解决方案1
collections.defaultdict
and dictionary comprehension collections.defaultdict
和字典理解
from collections import defaultdict
d = defaultdict(dict)
[d[c2].setdefault(c1, c3) for i, c1, c2, c3 in df.itertuples()];
pd.DataFrame(d)
lmn pqr xyz
A NaN 2 1.0
B 1.0 3 2.0
C NaN 2 NaN
Creative Solution 2 创意解决方案2
pd.factorize
and np.bincount
pd.factorize
和np.bincount
f1, u1 = pd.factorize(df.Col1.values)
f2, u2 = pd.factorize(df.Col2.values)
w = df.Col3.values
n, m = u1.size, u2.size
v = np.bincount(f1 * n + f2, w, n * m).reshape(n, m)
pd.DataFrame(np.ma.array(v, mask=v == 0), u1, u2)
lmn pqr xyz
A NaN 2 1.0
B 1.0 3 2.0
C NaN 2 NaN
Timing 定时
%timeit df.pivot(index='Col1',columns='Col2',values='Col3')
%timeit df.set_index(['Col1','Col2'])['Col3'].unstack()
%timeit df.groupby(['Col1','Col2'])['Col3'].mean().unstack()
%timeit df.pivot_table(index='Col1',columns='Col2',values='Col3')
%%timeit
d = defaultdict(dict)
[d[c2].setdefault(c1, c3) for i, c1, c2, c3 in df.itertuples()];
pd.DataFrame(d)
%%timeit
f1, u1 = pd.factorize(df.Col1.values)
f2, u2 = pd.factorize(df.Col2.values)
w = df.Col3.values
n, m = u1.size, u2.size
v = np.bincount(f1 * n + f2, w, n * m).reshape(n, m)
pd.DataFrame(np.ma.array(v, mask=v == 0), u1, u2)
small data 小数据
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.67 ms per loop
1000 loops, best of 3: 1.51 ms per loop
100 loops, best of 3: 4.17 ms per loop
1000 loops, best of 3: 1.18 ms per loop
1000 loops, best of 3: 420 µs per loop
medium data 中等数据
from string import ascii_letters
l = list(ascii_letters)
df = pd.DataFrame(dict(
Col1=np.random.choice(l, 10000),
Col2=np.random.choice(l, 10000),
Col3=np.random.randint(10, size=10000)
)).drop_duplicates(['Col1', 'Col2'])
1000 loops, best of 3: 1.75 ms per loop
100 loops, best of 3: 2.17 ms per loop
100 loops, best of 3: 2.2 ms per loop
100 loops, best of 3: 4.89 ms per loop
100 loops, best of 3: 5.6 ms per loop
1000 loops, best of 3: 549 µs per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.