简体   繁体   English

如何按列分组并将其他列的值作为列表返回到pandas中?

[英]How to groupby by a column and return the values of other columns as lists in pandas?

I am facing trouble clubbing the values of a column together and retaining the respective values of other columns. 我遇到麻烦,将列的值合在一起并保留其他列的相应值。 I would like to do something similar to this: grouping rows in list in pandas groupby 我想做类似的事情: 在pandas groupby中对列表中的行进行分组

But instead, I want the list/dictionary (preferably latter) to contain values of multiple columns. 但相反,我希望列表/字典(最好是后者)包含多列的值。 Example for this dataframe: 此数据框的示例:

df: DF:

Col1   Col2   Col3
A      xyz     1
A      pqr     2
B      xyz     2
B      pqr     3
B      lmn     1
C      pqr     2

I want something like- 我想要的东西像 -

A {'xyz':1, 'pqr': 2}
B {'xyz':2, 'pqr': 3, 'lmn': 1}
C {'pqr':2}

I tried doing 我试过了

df.groupby('Col1')[['Col2', 'Col3']].apply(list) 

which is a variant of the solution mentioned in the linked post, but isn't giving me the result I need. 这是链接帖子中提到的解决方案的变体,但没有给我我需要的结果。

From that point on, I would also like to transform it into a dataframe of the form: 从那时起,我还想将其转换为以下形式的数据框:

  xyz  pqr  lmn
A  1    2    NaN
B  2    3    1
C  NaN  2    NaN

What you want in the end is a pivot table. 你最终想要的是一个数据透视表。

df.pivot_table(index='Col1',columns='Col2',values='Col3')

look up the documentation for more options. 查找文档以获取更多选项。

Use pivot or unstack : 使用pivotunstack

df = df.pivot(index='Col1',columns='Col2',values='Col3')
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

df = df.set_index(['Col1','Col2'])['Col3'].unstack()
print (df)

Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

but if: 但如果:

ValueError: Index contains duplicate entries, cannot reshape ValueError:索引包含重复的条目,无法重新整形

it means duplicates, need pivot_table or aggregate with groupby by mean (can be changed to sum , median ), and last reshape by unstack : 这意味着重复,需要pivot_table或骨料与groupbymean (可改为summedian ),并最后通过整形unstack

print (df)
  Col1 Col2  Col3
0    A  xyz     1 <-same A, xyz
1    A  xyz     5 <-same A, xyz
2    A  pqr     2
3    B  xyz     2
4    B  pqr     3
5    B  lmn     1
6    C  pqr     2

df = df.groupby(['Col1','Col2'])['Col3'].mean().unstack()
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  3.0 (1+5)/2 = 3
B     1.0  3.0  2.0
C     NaN  2.0  NaN

EDIT: 编辑:

For check all duplicates by Col1 and Col2 : 要检查Col1Col2所有重复项:

print (df[df.duplicated(subset=['Col1','Col2'], keep=False)])
  Col1 Col2  Col3
0    A  xyz     1
1    A  xyz     5

EDIT1: EDIT1:

If need only first row if duplicates: 如果重复,只需要第一行:

df = df.groupby(['Col1','Col2'])['Col3'].first().unstack()
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

Or better first remove duplicates by drop_duplicates and then use first or second solution: 或者最好先通过drop_duplicates删除重复drop_duplicates ,然后使用第一个或第二个解决方案:

df = df.drop_duplicates(subset=['Col1','Col2'])
df = df.pivot(index='Col1',columns='Col2',values='Col3')
print (df)
Col2  lmn  pqr  xyz
Col1               
A     NaN  2.0  1.0
B     1.0  3.0  2.0
C     NaN  2.0  NaN

Neither one of these are pandas only solutions. 这些都不是pandas唯一的解决方案。 I provided them because I find exploring alternatives fun. 我提供他们因为我发现探索替代品很有趣。 The bincount base solution is very fast but less transparent. bincount基本解决方案非常快但透明度较低。

Creative Solution 1 创意解决方案1
collections.defaultdict and dictionary comprehension collections.defaultdict和字典理解

from collections import defaultdict

d = defaultdict(dict)
[d[c2].setdefault(c1, c3) for i, c1, c2, c3 in df.itertuples()];
pd.DataFrame(d)

   lmn  pqr  xyz
A  NaN    2  1.0
B  1.0    3  2.0
C  NaN    2  NaN

Creative Solution 2 创意解决方案2
pd.factorize and np.bincount pd.factorizenp.bincount

f1, u1 = pd.factorize(df.Col1.values)
f2, u2 = pd.factorize(df.Col2.values)
w = df.Col3.values

n, m = u1.size, u2.size

v = np.bincount(f1 * n + f2, w, n * m).reshape(n, m)
pd.DataFrame(np.ma.array(v, mask=v == 0), u1, u2)

   lmn  pqr  xyz
A  NaN    2  1.0
B  1.0    3  2.0
C  NaN    2  NaN

Timing 定时

%timeit df.pivot(index='Col1',columns='Col2',values='Col3')
%timeit df.set_index(['Col1','Col2'])['Col3'].unstack()
%timeit df.groupby(['Col1','Col2'])['Col3'].mean().unstack()
%timeit df.pivot_table(index='Col1',columns='Col2',values='Col3')

%%timeit
d = defaultdict(dict)
[d[c2].setdefault(c1, c3) for i, c1, c2, c3 in df.itertuples()];
pd.DataFrame(d)

%%timeit
f1, u1 = pd.factorize(df.Col1.values)
f2, u2 = pd.factorize(df.Col2.values)
w = df.Col3.values

n, m = u1.size, u2.size

v = np.bincount(f1 * n + f2, w, n * m).reshape(n, m)
pd.DataFrame(np.ma.array(v, mask=v == 0), u1, u2)

small data 小数据

1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.67 ms per loop
1000 loops, best of 3: 1.51 ms per loop
100 loops, best of 3: 4.17 ms per loop

1000 loops, best of 3: 1.18 ms per loop

1000 loops, best of 3: 420 µs per loop

medium data 中等数据

from string import ascii_letters
l = list(ascii_letters)
df = pd.DataFrame(dict(
        Col1=np.random.choice(l, 10000),
        Col2=np.random.choice(l, 10000),
        Col3=np.random.randint(10, size=10000)
    )).drop_duplicates(['Col1', 'Col2'])

1000 loops, best of 3: 1.75 ms per loop
100 loops, best of 3: 2.17 ms per loop
100 loops, best of 3: 2.2 ms per loop
100 loops, best of 3: 4.89 ms per loop

100 loops, best of 3: 5.6 ms per loop

1000 loops, best of 3: 549 µs per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas groupby,根据其他列的最大值返回1列的行 - Pandas groupby, return rows of 1 column based on maximum values of other columns Pandas:groupby 列 A 并从其他列制作元组列表? - Pandas: groupby column A and make lists of tuples from other columns? Python Pandas:如何更新 groupby 中其他列的值? - Python Pandas: How to update values for other column in groupby? 熊猫在列表列上分组 - Pandas groupby on a column of lists 如何使用pandas使用同一行的其他列值的输入返回列值? - how to return columns values with the input of other column values of same row using pandas? 如何对 pandas 中的值进行分组,但使用列表作为索引? - How to groupby values in pandas but using lists as an index? 对列表的熊猫列进行排序并调用其他列 - Sort pandas column of lists and call other columns Pandas Dataframe 使用 Groupby 从其他两列的唯一值创建下一个未来日期的列 - Pandas Dataframe Create Column of Next Future Date from Unique values of two other columns, with Groupby 熊猫:创建条件列,并基于另一个df.groupby中2列的值返回值 - Pandas: create a conditional column and return a value based on the values of 2 columns in another df.groupby Groupby 列并为其他列创建列表,保留顺序 - Groupby column and create lists for other columns, preserving order
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM