简体   繁体   中英

Remove rows from pandas dataframe based on multiple columns with similar values

I have a dataframe with a few thousand rows and multiple columns.

I want to reduce the size of this dataframe by removing rows which values of columns A, C and D are too similar, and column D is equal. In other words, where the difference of the values of each column is below a threshold. This threshold can be different for each column. Also, I want to keep the row with the highest value based on column E.

I have a code that populates a new dataframe and checks if each row of the old dataframe is too similar to anything already present in the new dataframe.

cols = [list-of-column-names]
df = pd.DataFrame(l, columns=cols) # l is a list of thousands of lists with values to populate the dataframe
df.sort_values(by='E', ascending=False, inplace=True) # Sort based on the column I want to keep the highest value

new_df = pd.DataFrame(columns=cols) # Create new dataframe
for i, line in df.iterrows(): # Iterate over old dataframe
    if len(
            new_df[
                (THRESHOLD_A1 < abs(1e6 * (new_df['A'] - line['A']) / new_df['A'])) & (
                        abs(1e6 * (new_df['A'] - line['A']) / new_df['A']) < THRESHOLD_A2) &
                (new_df['E'] == line['E']) &
                (abs(new_df['C'] - line['C']) < THRESHOLD_C) &
                ((abs(new_df['D'] - line['D']) / new_df['D']) < THRESHOLD_D)
            ]
    ) == 0: # If no row in the new dataframe was found, then append this row to new dataframe
        new_df = pd.concat([new_df, pd.DataFrame([line])])

However, this code is too slow. Is there a better way to write this?


Example:

d = {
  'A': [1, 1.5, 1.4, 7, 8],
  'B': [10, 11, 11.5, 13, 14],
  'C': [50, 50.5, 50.6, 60, 70],
  'D': [5, 4, 5, 3, 2],
  'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
    
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
'''
# Values are too similar if absolute difference between values of same column is below threshold
# values in column D needs to be the same
# If two rows are too similar, preserve the one with highest value in E column

This would remove row 0

Rational
row 0 ['E'] == row 2 ['E']
abs(row 0 ['A'] - row 2 ['A']) == 0.4 <= THRESHOLD_A
abs(row 0 ['B'] - row 2 ['B']) == 1.5 <= THRESHOLD_B
abs(row 0 ['C'] - row 2 ['C']) == 0.6 <= THRESHOLD_C
    
row 2 has the highest value in column 'D' == 102.
'''

     A     B     C  D    E
0  1.0  10.0  50.0  5  100
1  1.5  11.0  50.5  4  101
2  1.4  11.5  50.6  5  102
3  7.0  13.0  60.0  3  103
4  8.0  14.0  70.0  2  104

Output:

     A     B     C  D    E
0  1.5  11.0  50.5  4  101
1  1.4  11.5  50.6  5  102
2  7.0  13.0  60.0  3  103
3  8.0  14.0  70.0  2  104

One approach is to round number to specific number of floating point then apply group by on the result. The problem is that the thresholds can not be set very arbitrary way. Demonstration:

df = pd.DataFrame({'A':[1.514, 1.54, 4.86], 'B': [1.51, 3.58, 4.01], 'C': [1.21, 8.52,4.21], 'E': [5,10,20]})
es = df.groupby(df['A'].round(1)).apply(lambda x: x['E'].max())
df[df['E'].isin(es)]

Output:

    A   B   C   E
1   1.54    3.58    8.52    10
2   4.86    4.01    4.21    20

Here as the A value of two rows are similar, based on the column E we keep the one with the max value. A better option is to use pd.cut :

t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
pd.cut(df['B'], bins = t2)

which provides you groups. Based on your new sample data:

0    (0.993, 2.0]
1    (0.993, 2.0]
2    (0.993, 2.0]
3      (6.0, 7.0]
4      (7.0, 8.0]
Name: A, dtype: category

Demonstration on your sample:

THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1

d = {'A': [1, 1.5, 1.4, 7, 8],
     'B': [10, 11, 11.5, 13, 14],
     'C': [50, 50.5, 50.6, 60, 70],
     'D': [5, 4, 5, 3, 2],
     'E': [100, 101, 102, 103, 104]
}

df = pd.DataFrame(d)
cs = df.columns

t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
c1 = pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
c2 = pd.cut(df['B'], bins = t2)
t3 = int((df['C'].max()-df['C'].min())/THRESHOLD_C)
c3 = pd.cut(df['C'], bins = t3)

df['c1'] = c1
df['c2'] = c2
df['c3'] = c3
t = df.groupby(['c1', 'c2', 'c3', 'D'])['E'].apply(lambda x: x.max()).reset_index()['E']
es = t[t.notna()]

df[df['E'].isin(es)][cs]

Output:

    A   B   C   D   E
1   1.5 11.0    50.5    4   101
2   1.4 11.5    50.6    5   102
3   7.0 13.0    60.0    3   103
4   8.0 14.0    70.0    2   104

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM