简体   繁体   中英

Identify increasing features in a data frame

I have a data frame that present some features with cumulative values. I need to identify those features in order to revert the cumulative values. This is how my dataset looks (plus about 50 variables):

a      b     
346    17    
76     52    
459    70    
680    96    
679    167   
246    180   

What I wish to achieve is:

a      b     
346    17    
76     35    
459    18    
680    26    
679    71   
246    13   

I've seem this answer, but it first revert the values and then try to identify the columns. Can't I do the other way around? First identify the features and then revert the values?

Finding cumulative features in dataframe?

What I do at the moment is run the following code in order to give me the feature's names with cumulative values:

 def accmulate_col(value):
     count = 0
     count_1 = False
     name = []
     for i in range(len(value)-1):
         if value[i+1]-value[i] >= 0:
             count += 1
         if value[i+1]-value[i] > 0:
             count_1 = True
     name.append(1) if count == len(value)-1 and count_1 else name.append(0)
     return name

 df.apply(accmulate_col)

Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset:

df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))

Is there a better way to solve my problem?

To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. So in that sense, you have to use the values first to figure out what columns fit the conditions.

With that out of the way, given a dataframe such as:

import pandas as pd
d = {'a': [1,2,3,4],
     'b': [4,3,2,1]
     }
df = pd.DataFrame(d)
#Output:
   a  b
0  1  4
1  2  3
2  3  2
3  4  1

Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column.

That can be written as:

out = (df.diff().dropna()>0).all()
#Output:
a     True
b    False
dtype: bool

Then, you can just use the column names to select only those with True in them

new_df = df[df.columns[out]]
#Output:
   a
0  1
1  2
2  3
3  4

*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM