I have a data frame that present some features with cumulative values. I need to identify those features in order to revert the cumulative values. This is how my dataset looks (plus about 50 variables):
a b
346 17
76 52
459 70
680 96
679 167
246 180
What I wish to achieve is:
a b
346 17
76 35
459 18
680 26
679 71
246 13
I've seem this answer, but it first revert the values and then try to identify the columns. Can't I do the other way around? First identify the features and then revert the values?
What I do at the moment is run the following code in order to give me the feature's names with cumulative values:
def accmulate_col(value):
count = 0
count_1 = False
name = []
for i in range(len(value)-1):
if value[i+1]-value[i] >= 0:
count += 1
if value[i+1]-value[i] > 0:
count_1 = True
name.append(1) if count == len(value)-1 and count_1 else name.append(0)
return name
df.apply(accmulate_col)
Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset:
df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))
Is there a better way to solve my problem?
To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. So in that sense, you have to use the values first to figure out what columns fit the conditions.
With that out of the way, given a dataframe such as:
import pandas as pd
d = {'a': [1,2,3,4],
'b': [4,3,2,1]
}
df = pd.DataFrame(d)
#Output:
a b
0 1 4
1 2 3
2 3 2
3 4 1
Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column.
That can be written as:
out = (df.diff().dropna()>0).all()
#Output:
a True
b False
dtype: bool
Then, you can just use the column names to select only those with True
in them
new_df = df[df.columns[out]]
#Output:
a
0 1
1 2
2 3
3 4
*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.