识别数据框中不断增加的功能

Question

I have a data frame that present some features with cumulative values. 我有一个数据框架，该框架显示具有累积值的某些功能。 I need to identify those features in order to revert the cumulative values. 我需要识别这些功能以便还原累积值。 This is how my dataset looks (plus about 50 variables): 这是我的数据集的外观（加上大约50个变量）：

What I wish to achieve is: 我希望实现的是：

I've seem this answer, but it first revert the values and then try to identify the columns. 我似乎是这个答案，但它首先还原值，然后尝试识别列。 Can't I do the other way around? 我不能反过来吗？ First identify the features and then revert the values? 首先确定功能，然后还原值？

Finding cumulative features in dataframe? 在数据框中查找累积特征？

What I do at the moment is run the following code in order to give me the feature's names with cumulative values: 我现在正在运行以下代码，以便为我提供具有累积值的功能名称：

 def accmulate_col(value):
     count = 0
     count_1 = False
     name = []
     for i in range(len(value)-1):
         if value[i+1]-value[i] >= 0:
             count += 1
         if value[i+1]-value[i] > 0:
             count_1 = True
     name.append(1) if count == len(value)-1 and count_1 else name.append(0)
     return name

 df.apply(accmulate_col)

Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset: 之后，我将这些功能名称手动保存在一个名为cum_features的列表中，并还原这些值，以创建所需的数据集：

df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))

Is there a better way to solve my problem? 有没有更好的方法来解决我的问题？

Answer 1

To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. 为了确定在整个列中哪些列的值具有递增*，您将需要对所有值应用条件。 So in that sense, you have to use the values first to figure out what columns fit the conditions. 因此，从这种意义上讲，您必须首先使用值来确定哪些列符合条件。

With that out of the way, given a dataframe such as: 有了这种方式，给定一个数据框，例如：

import pandas as pd
d = {'a': [1,2,3,4],
     'b': [4,3,2,1]
     }
df = pd.DataFrame(d)
#Output:
   a  b
0  1  4
1  2  3
2  3  2
3  4  1

Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column. 弄清楚哪些列包含递增的值只是在数据帧中的所有值上使用diff并检查整个列中哪些递增的问题。

That can be written as: 可以写成：

out = (df.diff().dropna()>0).all()
#Output:
a     True
b    False
dtype: bool

Then, you can just use the column names to select only those with True in them 然后，您可以仅使用列名选择其中包含True那些列。

new_df = df[df.columns[out]]
#Output:
   a
0  1
1  2
2  3
3  4

*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.) *（术语“累计”并不真正代表您所使用的条件。您希望它是累计的还是只是增加？累积表示特定行/索引中的值是该索引之前所有先前值的总和，同时增加仅仅是，当前行/索引中的值大于先前的值。）

识别数据框中不断增加的功能

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-08-06 13:13:07

识别数据框中不断增加的功能

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-08-06 13:13:07

解决方案1
0 已采纳 2019-08-06 13:13:07