简体   繁体   English

Pandas:根据数据的斜率分割数据框

[英]Pandas: splitting data frame based on the slope of data

I have this data frame我有这个数据框

x = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})

在此处输入图片说明

Update: I want a function If the slope is negetive and the length of the group is more than 2 then it should return True, index of start and end of the group.更新:我想要一个函数如果斜率为负并且组的长度大于 2 那么它应该返回 True,组的开始和结束的索引。 for this case it should return: result= True , index= 5 , index= 8对于这种情况,它应该返回: result= True , index= 5 , index= 8

1- I want to split the data frame based on the slope. 1- 我想根据斜率拆分数据框。 This example should have 6 groups.这个例子应该有 6 个组。

2- how can I check the length of groups? 2-如何检查组的长度?

在此处输入图片说明

I tried to get groups by the below code but I don't know how can split the data frame and how can check the length of each part我试图通过以下代码获取组,但我不知道如何拆分数据框以及如何检查每个部分的长度

New update : Thanks Matt W. for his code.新更新:感谢 Matt W. 的代码。 finally I found the solution.最后我找到了解决方案。

df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().fillna(0)
df.loc[df['diff'] < 0, 'diff'] = -1

init = [0]
for x in df['diff'] == df['diff'].shift(1):
    if x:
        init.append(init[-1])
    else:
        init.append(init[-1]+1)
def get_slope(df):
    x=np.array(df.iloc[:,0].index)
    y=np.array(df.iloc[:,0])
    X = x - x.mean()
    Y = y - y.mean()
    slope = (X.dot(Y)) / (X.dot(X))
    return slope
df['g'] = init[1:]

df.groupby('g').apply(get_slope)

Result结果

0    NaN
1    NaN
2    NaN
3    0.0
4    NaN
5   -1.5
6    NaN

Take the difference and bfill() the start so that you have the same number in the 0th element.取差异并bfill()开始,以便您在第 0 个元素中具有相同的数字。 Then turn all negatives the same so we can imitate them being the same "slope".然后把所有的底片都一样,这样我们就可以模仿它们是相同的“斜率”。 Then I shift it to check to see if the next number is the same and iterate through giving us a list of when it changes, assigning that to g .然后我将它移动以检查下一个数字是否相同,并通过给我们一个它何时更改的列表进行迭代,并将其分配给g

df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[df['diff'] < 0, 'diff'] = -1

init = [0]
for x in df['diff'] == df['diff'].shift(1):
    if x:
        init.append(init[-1])
    else:
        init.append(init[-1]+1)
df['g'] = init[1:]
df
   entity  diff  g
0       5   2.0  1
1       7   2.0  1
2       5  -1.0  2
3       5   0.0  3
4       5   0.0  3
5       6   1.0  4
6       3  -1.0  5
7       2  -1.0  5
8       0  -1.0  5
9       5   5.0  6

Just wanted to present another solution that doesn't require a for-loop:只是想提出另一个不需要 for 循环的解决方案:

df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[diff < 0, 'diff'] = -1
df['g'] = (~(df['diff'] == df['diff'].shift(1))).cumsum()
df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM