简体   繁体   English

在满足列条件后,Pandas会截断DataFrame

[英]Pandas truncate DataFrame after a column condition is met

So I have the following DataFrame df: 所以我有以下DataFrame df:

在此输入图像描述

The frame contains two groups of data that are sorted within that group. 该框架包含两组在该组中排序的数据。

Group 1 is from index 359 to 365 inclusive 第1组来自359至365的索引

Group 2 is from index 366 to 371 inclusive 第2组来自指数366至371(含)

I want to separate them into the two groups. 我想将它们分成两组。 There may be more than two groups. 可能有两个以上的小组。 The logic I am applying is whenever the next STEPS_ID is less than the current STEPS_ID, this marks the end of the group. 我正在应用的逻辑是每当下一个STEPS_ID小于当前STEPS_ID时,这标志着该组的结束。

I am easily able to get this pointer by df.STEPS_ID <= df.STEPS_ID.shift(-1) 我很容易通过df.STEPS_ID <= df.STEPS_ID.shift(-1)得到这个指针

Is there an elegant pandas way to achieve this easily possibly using vectorized operations rather than for loop? 是否有一种优雅的熊猫方式可以轻松实现这一点,可能使用矢量化操作而不是循环?

This seems to be a common enough problem that I am sure there must be a well-defined algorithm to solve these kinds of problems. 这似乎是一个常见的问题,我相信必须有一个明确定义的算法来解决这些问题。 I would also appreciate if you guys could guide me in reading up on the theoretical basis for such algorithms. 如果你们能指导我阅读这些算法的理论基础,我也将不胜感激。

There is more than one way to "separate things into groups". “将事物分成小组”的方法不止一种。 One way would be to make a list of groups. 一种方法是制作一个组列表。 But that is not the ideal way when dealing with a Pandas DataFrame. 但这不是处理Pandas DataFrame时的理想方式。 Once you have a list, you are forced to loop over the list in a Python loop. 一旦有了列表,就不得不在Python循环中遍历列表。 Those are comparatively slow compared to native Pandas operations. 与本土熊猫作业相比,这些相对较慢。

Assuming you have enough memory, a better way would be to add an column or index to the DataFrame: 假设您有足够的内存,更好的方法是向DataFrame添加列或索引:

import pandas as pd
df = pd.DataFrame({'STEPS_ID':range(1107,1113)*2})
df['GROUP'] = (df['STEPS_ID'] < df['STEPS_ID'].shift(1)).astype('int').cumsum()
# df.set_index('GROUP', inplace=True, append=True)
print(df)

yields 产量

    STEPS_ID  GROUP
0       1107      0
1       1108      0
2       1109      0
3       1110      0
4       1111      0
5       1112      0
6       1107      1
7       1108      1
8       1109      1
9       1110      1
10      1111      1
11      1112      1

Now you can do aggregation/transformation operations on each group by calling 现在,您可以通过调用对每个组执行聚合/转换操作

df.groupby('GROUP')....

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM