简体   繁体   English

如何根据列中连续出现的值将 pandas dataframe 拆分为多个部分?

[英]How to split pandas dataframe into multiple parts based on consecutively occuring values in a column?

I have a dataframe which I am representing in a tabular format below.我有一个 dataframe,我在下面以表格格式表示。 The original dataframe is a lot bigger in size and therefore I cannot afford to loop on each row.原来的 dataframe 尺寸要大得多,因此我不能在每一行上循环。

col1 | col2 | col3
a      x     1
b      y     1
c      z     0
d      k     1
e      l     1

What I want is split it into subsets of dataframes with consecutive number of 1 s in the column col3 .我想要的是将其拆分为col3列中连续数为1的数据帧的子集。 So ideally I want to above dataframe to return two dataframes df1 and df2所以理想情况下,我想在 dataframe 之上返回两个数据帧df1df2

df1

col1 | col2 | col3
a      x     1
b      y     1

df2

col1 | col2 | col3
d      k     1
e      l     1

Is there an approach like groupby to do this?有没有像groupby这样的方法来做到这一点? If I use groupby it returns me all the 4 rows in a dataframe with col3==1 .如果我使用groupby ,它将返回 dataframe 中的所有 4 行col3==1 I do not want that as I need two dataframes each consisting of consecutively occuring 1 s.我不希望这样,因为我需要两个数据帧,每个数据帧由连续出现的1组成。 One method is to obviously loop by the rows and as and when I find a 0, I can return a dataframe but that is not efficient.一种方法显然是逐行循环,当我找到 0 时,我可以返回 dataframe 但这不是有效的。 Any kind of help is appreciated.任何形式的帮助表示赞赏。

First compare values by 1 , then create consecutive groups by shift and cumulative sum and last in list comprehension with groupby get all groups:首先比较值1 ,然后通过shift和累积总和创建连续组,最后在列表理解中使用groupby获取所有组:

m1 = df['col3'].eq(1)
g = m1.ne(m1.shift()).cumsum()

dfs = [x for i, x in df[m1].groupby(g)]
print (dfs)
[  col1 col2  col3
0    a    x     1
1    b    y     1,   col1 col2  col3
3    d    k     1
4    e    l     1]

print (dfs[0])
  col1 col2  col3
0    a    x     1
1    b    y     1

If also is necessary remove single 1 rows is added Series.duplicated with keep=False :如果还需要删除单个1行添加Series.duplicated with keep=False

print (df)
  col1 col2  col3
0    a    x     1
1    b    y     1
2    c    z     0
3    d    k     1
4    e    l     1
5    f    m     0
6    g    n     1 <- removed

m1 = df['col3'].eq(1)
g = m1.ne(m1.shift()).cumsum()

g = g[g.duplicated(keep=False)]
print (g)
0    1
1    1
3    3
4    3
Name: col3, dtype: int32

dfs = [x for i, x in df[m1].groupby(g)]
print (dfs)
[  col1 col2  col3
0    a    x     1
1    b    y     1,   col1 col2  col3
3    d    k     1
4    e    l     1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM