I have a dataframe with a few thousand rows and multiple columns.
I want to reduce the size of this dataframe by removing rows which values of columns A, C and D are too similar, and column D is equal. In other words, where the difference of the values of each column is below a threshold. This threshold can be different for each column. Also, I want to keep the row with the highest value based on column E.
I have a code that populates a new dataframe and checks if each row of the old dataframe is too similar to anything already present in the new dataframe.
cols = [list-of-column-names]
df = pd.DataFrame(l, columns=cols) # l is a list of thousands of lists with values to populate the dataframe
df.sort_values(by='E', ascending=False, inplace=True) # Sort based on the column I want to keep the highest value
new_df = pd.DataFrame(columns=cols) # Create new dataframe
for i, line in df.iterrows(): # Iterate over old dataframe
if len(
new_df[
(THRESHOLD_A1 < abs(1e6 * (new_df['A'] - line['A']) / new_df['A'])) & (
abs(1e6 * (new_df['A'] - line['A']) / new_df['A']) < THRESHOLD_A2) &
(new_df['E'] == line['E']) &
(abs(new_df['C'] - line['C']) < THRESHOLD_C) &
((abs(new_df['D'] - line['D']) / new_df['D']) < THRESHOLD_D)
]
) == 0: # If no row in the new dataframe was found, then append this row to new dataframe
new_df = pd.concat([new_df, pd.DataFrame([line])])
However, this code is too slow. Is there a better way to write this?
Example:
d = {
'A': [1, 1.5, 1.4, 7, 8],
'B': [10, 11, 11.5, 13, 14],
'C': [50, 50.5, 50.6, 60, 70],
'D': [5, 4, 5, 3, 2],
'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
'''
# Values are too similar if absolute difference between values of same column is below threshold
# values in column D needs to be the same
# If two rows are too similar, preserve the one with highest value in E column
This would remove row 0
Rational
row 0 ['E'] == row 2 ['E']
abs(row 0 ['A'] - row 2 ['A']) == 0.4 <= THRESHOLD_A
abs(row 0 ['B'] - row 2 ['B']) == 1.5 <= THRESHOLD_B
abs(row 0 ['C'] - row 2 ['C']) == 0.6 <= THRESHOLD_C
row 2 has the highest value in column 'D' == 102.
'''
A B C D E
0 1.0 10.0 50.0 5 100
1 1.5 11.0 50.5 4 101
2 1.4 11.5 50.6 5 102
3 7.0 13.0 60.0 3 103
4 8.0 14.0 70.0 2 104
Output:
A B C D E
0 1.5 11.0 50.5 4 101
1 1.4 11.5 50.6 5 102
2 7.0 13.0 60.0 3 103
3 8.0 14.0 70.0 2 104
One approach is to round number to specific number of floating point then apply group by on the result. The problem is that the thresholds can not be set very arbitrary way. Demonstration:
df = pd.DataFrame({'A':[1.514, 1.54, 4.86], 'B': [1.51, 3.58, 4.01], 'C': [1.21, 8.52,4.21], 'E': [5,10,20]})
es = df.groupby(df['A'].round(1)).apply(lambda x: x['E'].max())
df[df['E'].isin(es)]
Output:
A B C E
1 1.54 3.58 8.52 10
2 4.86 4.01 4.21 20
Here as the A value of two rows are similar, based on the column E we keep the one with the max value. A better option is to use pd.cut
:
t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
pd.cut(df['B'], bins = t2)
which provides you groups. Based on your new sample data:
0 (0.993, 2.0]
1 (0.993, 2.0]
2 (0.993, 2.0]
3 (6.0, 7.0]
4 (7.0, 8.0]
Name: A, dtype: category
Demonstration on your sample:
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
d = {'A': [1, 1.5, 1.4, 7, 8],
'B': [10, 11, 11.5, 13, 14],
'C': [50, 50.5, 50.6, 60, 70],
'D': [5, 4, 5, 3, 2],
'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
cs = df.columns
t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
c1 = pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
c2 = pd.cut(df['B'], bins = t2)
t3 = int((df['C'].max()-df['C'].min())/THRESHOLD_C)
c3 = pd.cut(df['C'], bins = t3)
df['c1'] = c1
df['c2'] = c2
df['c3'] = c3
t = df.groupby(['c1', 'c2', 'c3', 'D'])['E'].apply(lambda x: x.max()).reset_index()['E']
es = t[t.notna()]
df[df['E'].isin(es)][cs]
Output:
A B C D E
1 1.5 11.0 50.5 4 101
2 1.4 11.5 50.6 5 102
3 7.0 13.0 60.0 3 103
4 8.0 14.0 70.0 2 104
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.