简体   繁体   English

识别数据框中不断增加的功能

[英]Identify increasing features in a data frame

I have a data frame that present some features with cumulative values. 我有一个数据框架,该框架显示具有累积值的某些功能。 I need to identify those features in order to revert the cumulative values. 我需要识别这些功能以便还原累积值。 This is how my dataset looks (plus about 50 variables): 这是我的数据集的外观(加上大约50个变量):

a      b     
346    17    
76     52    
459    70    
680    96    
679    167   
246    180   

What I wish to achieve is: 我希望实现的是:

a      b     
346    17    
76     35    
459    18    
680    26    
679    71   
246    13   

I've seem this answer, but it first revert the values and then try to identify the columns. 我似乎是这个答案,但它首先还原值,然后尝试识别列。 Can't I do the other way around? 我不能反过来吗? First identify the features and then revert the values? 首先确定功能,然后还原值?

Finding cumulative features in dataframe? 在数据框中查找累积特征?

What I do at the moment is run the following code in order to give me the feature's names with cumulative values: 我现在正在运行以下代码,以便为我提供具有累积值的功能名称:

 def accmulate_col(value):
     count = 0
     count_1 = False
     name = []
     for i in range(len(value)-1):
         if value[i+1]-value[i] >= 0:
             count += 1
         if value[i+1]-value[i] > 0:
             count_1 = True
     name.append(1) if count == len(value)-1 and count_1 else name.append(0)
     return name

 df.apply(accmulate_col)

Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset: 之后,我将这些功能名称手动保存在一个名为cum_features的列表中,并还原这些值,以创建所需的数据集:

df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))

Is there a better way to solve my problem? 有没有更好的方法来解决我的问题?

To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. 为了确定在整个列中哪些列的值具有递增*,您将需要对所有值应用条件。 So in that sense, you have to use the values first to figure out what columns fit the conditions. 因此,从这种意义上讲,您必须首先使用值来确定哪些列符合条件。

With that out of the way, given a dataframe such as: 有了这种方式,给定一个数据框,例如:

import pandas as pd
d = {'a': [1,2,3,4],
     'b': [4,3,2,1]
     }
df = pd.DataFrame(d)
#Output:
   a  b
0  1  4
1  2  3
2  3  2
3  4  1

Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column. 弄清楚哪些列包含递增的值只是在数据帧中的所有值上使用diff并检查整个列中哪些递增的问题。

That can be written as: 可以写成:

out = (df.diff().dropna()>0).all()
#Output:
a     True
b    False
dtype: bool

Then, you can just use the column names to select only those with True in them 然后,您可以仅使用列名选择其中包含True那些列。

new_df = df[df.columns[out]]
#Output:
   a
0  1
1  2
2  3
3  4

*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.) *(术语“累计”并不真正代表您所使用的条件。您希望它是累计的还是只是增加?累积表示特定行/索引中的值是该索引之前所有先前值的总和,同时增加仅仅是,当前行/索引中的值大于先前的值。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM