[英]Groupby into list for non consecutive values
I am trying to group by this dataset我正在尝试按此数据集进行分组
col1 col2
0 A 1
1 B 1
2 C 1
3 D 3
4 E 3
5 F 2
6 G 2
7 H 1
8 I 1
9 j 2
10 K 2
into this进入这个
1 : [A, B, C]
3: [D, E]
2: [ F; G]
1: [ H, I]
2: [ J,K]
so it has to capture the difference in appearances of the elements and not group all at once.所以它必须捕捉元素外观的差异,而不是一次分组。
So far I was able to do the normal groupby, df.groupby("col2")["col1"].apply(list)
but it isn't correct.到目前为止,我能够执行正常的 groupby,
df.groupby("col2")["col1"].apply(list)
但它不正确。
You need distinguish consecutive values by compare shifted values foe not equal with cumulative sum, last remove second level of MultiIndex
:您需要通过比较不等于累积和的移位值来区分连续值,最后删除
MultiIndex
的第二级:
s = (df.groupby(["col2", df["col2"].ne(df["col2"].shift()).cumsum()])["col1"]
.agg(list)
.reset_index(level=1, drop=True))
Since Jezrael already answered is using pandas.由于 Jezrael 已经回答是使用 pandas。 I would like to add non pandas method.
我想添加非 pandas 方法。
I know this is not an efficient method but for learning purpose I included.我知道这不是一种有效的方法,但出于学习目的,我包括在内。
Using itertools's groupby
使用
itertools's groupby
from itertools import groupby
last_index = 0
for v, g in groupby(enumerate(df.col2), lambda k: k[1]):
l = [*g]
print(df.iloc[last_index]['col2'],':', df.iloc[last_index:l[-1][0]+1]['col1'].values)
last_index += len(l)
1 : ['A' 'B' 'C']
3 : ['D' 'E']
2 : ['F' 'G']
1 : ['H' 'I']
2 : ['j' 'K']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.