简体   繁体   English

如何根据另一列的两个连续值在 pandas 的新列中添加 label?

[英]How can I add a label in a new column in pandas based on two consecutive values of another column?

I've got a dataframe, df , with a single column, extension .我有一个 dataframe, df ,带有单列extension
The values in extension column are cyclically increasing and decreasing like below: extension列中的值循环增加和减少,如下所示:

extension
0.000
0.050
0.100
0.150
0.130
0.080
0.020
0.050
0.075

I'm trying to label each increasing and decreasing cycle like below:我正在尝试 label 每个递增和递减周期,如下所示:

extension lablel
0.000      1
0.050      1
0.100      1
0.150      1
0.130      1
0.080      1
0.020      1
0.050      2
0.075      2

I'm a bit stuck, and would appreciate some guidance here.我有点卡住了,希望能得到一些指导。

df['lablel']=df.extension.diff()#Find the difference between consecutive ros in the column extension
df['lablel']=(df.lablel.ge(0)&df.lablel.shift(1).le(0)|df.lablel.ge(0)&df.lablel.shift(-1).le(0)).cumsum()+1#Find zero crossing from the consecutive differences, cummulatively sum and add 1 to the outcome



 extension  lablel
0      0.000       1
1      0.050       1
2      0.100       1
3      0.150       2
4      0.130       2
5      0.080       2
6      0.020       2
7      0.050       3
8      0.075       3

So lets reproduce your data:因此,让我们重现您的数据:

a = [0.000,0.050,0.100,0.150,0.130,0.080,0.020,0.050,0.075]
df = pd.DataFrame(a, columns=["extension"])

The short answer is this:简短的回答是这样的:

df["label"] = pd.Series(np.where(df["extension"].diff() < 0, 0, 1)).diff().abs().cumsum() + 1
df.at[0,"label"] = 1

At least that's my answer.至少这是我的回答。 But it definitly looks a bit clunky.但它肯定看起来有点笨拙。 So let's break it down step by step for understanding:所以让我们一步一步分解来理解:

df["extension"].diff()

diff creates the difference between each cell and the previous. diff创建每个单元格与前一个单元格之间的差异。 Therefore it cannot calculate it for the first element.因此它不能为第一个元素计算它。

Output: Output:

0      NaN
1    0.050
2    0.050
3    0.050
4   -0.020
5   -0.050
6   -0.060
7    0.030
8    0.025

Now let's binarize the result to detect changes in positive/negative difference, using where from numpy:现在让我们使用 numpy 中的where将结果二值以检测正/负差异的变化:

np.where(df["extension"].diff() < 0, 0, 1)

Output: Output:

array([1, 1, 1, 1, 0, 0, 0, 1, 1])

This tells us if difference to previous is negative (--> 0) or positive (--> 1)这告诉我们与先前的差异是负数 (--> 0) 还是正数 (--> 1)

Then you want to know only when the positive/negative trend changes.然后,您只想知道正/负趋势何时发生变化。 Therefore we incorporate the diff function once more.因此,我们再次合并了diff function。 Beforehand we have to convert the numpy array back to a pd.Series :事先我们必须将 numpy 数组转换回pd.Series

pd.Series(np.where(df["extension"].diff() < 0, 0, 1)).diff()

Output: Output:

0    NaN
1    0.0
2    0.0
3    0.0
4   -1.0
5    0.0
6    0.0
7    1.0
8    0.0

Ultimately you're not interested in which direction the trend has changed, only THAT it changed, therefore we erase this information with the abs function.最终,您对趋势改变的方向不感兴趣,只是它改变了,因此我们使用abs function 删除此信息。 And then sum the result up with the cumsum function so that it can increase on every change:然后将结果与cumsum function 相加,以便每次更改都会增加:

pd.Series(np.where(df["extension"].diff() < 0, 0, 1)).diff().abs().cumsum()

Output: Output:

0    NaN
1    0.0
2    0.0
3    0.0
4    1.0
5    1.0
6    1.0
7    2.0
8    2.0

Finally two additions to base the label at 1 rather than 0 and to replace the first item that was NaN: +1 behind the code and df.at[0,"label"] = 1最后添加两个以将 label 设置为 1 而不是 0 并替换第一项 NaN: +1 在代码后面和df.at[0,"label"] = 1

And there you go:还有你 go:

         extension  label
    0      0.000    1.0
    1      0.050    1.0
    2      0.100    1.0
    3      0.150    1.0
    4      0.130    2.0
    5      0.080    2.0
    6      0.020    2.0
    7      0.050    3.0
    8      0.075    3.0

EDIT: answer to edited question in the comments编辑:回答评论中已编辑的问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM