[英]How to split pandas dataframe into multiple parts based on consecutively occuring values in a column?
I have a dataframe which I am representing in a tabular format below.我有一个 dataframe,我在下面以表格格式表示。 The original dataframe is a lot bigger in size and therefore I cannot afford to loop on each row.原来的 dataframe 尺寸要大得多,因此我不能在每一行上循环。
col1 | col2 | col3
a x 1
b y 1
c z 0
d k 1
e l 1
What I want is split it into subsets of dataframes with consecutive number of 1
s in the column col3
.我想要的是将其拆分为col3
列中连续数为1
的数据帧的子集。 So ideally I want to above dataframe to return two dataframes df1
and df2
所以理想情况下,我想在 dataframe 之上返回两个数据帧df1
和df2
df1
col1 | col2 | col3
a x 1
b y 1
df2
col1 | col2 | col3
d k 1
e l 1
Is there an approach like groupby
to do this?有没有像groupby
这样的方法来做到这一点? If I use groupby
it returns me all the 4 rows in a dataframe with col3==1
.如果我使用groupby
,它将返回 dataframe 中的所有 4 行col3==1
。 I do not want that as I need two dataframes each consisting of consecutively occuring 1
s.我不希望这样,因为我需要两个数据帧,每个数据帧由连续出现的1
组成。 One method is to obviously loop by the rows and as and when I find a 0, I can return a dataframe but that is not efficient.一种方法显然是逐行循环,当我找到 0 时,我可以返回 dataframe 但这不是有效的。 Any kind of help is appreciated.任何形式的帮助表示赞赏。
First compare values by 1
, then create consecutive groups by shift
and cumulative sum and last in list comprehension with groupby
get all groups:首先比较值1
,然后通过shift
和累积总和创建连续组,最后在列表理解中使用groupby
获取所有组:
m1 = df['col3'].eq(1)
g = m1.ne(m1.shift()).cumsum()
dfs = [x for i, x in df[m1].groupby(g)]
print (dfs)
[ col1 col2 col3
0 a x 1
1 b y 1, col1 col2 col3
3 d k 1
4 e l 1]
print (dfs[0])
col1 col2 col3
0 a x 1
1 b y 1
If also is necessary remove single 1
rows is added Series.duplicated
with keep=False
:如果还需要删除单个1
行添加Series.duplicated
with keep=False
:
print (df)
col1 col2 col3
0 a x 1
1 b y 1
2 c z 0
3 d k 1
4 e l 1
5 f m 0
6 g n 1 <- removed
m1 = df['col3'].eq(1)
g = m1.ne(m1.shift()).cumsum()
g = g[g.duplicated(keep=False)]
print (g)
0 1
1 1
3 3
4 3
Name: col3, dtype: int32
dfs = [x for i, x in df[m1].groupby(g)]
print (dfs)
[ col1 col2 col3
0 a x 1
1 b y 1, col1 col2 col3
3 d k 1
4 e l 1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.