简体   繁体   English

根据具有相似值的多列从熊猫数据框中删除行

[英]Remove rows from pandas dataframe based on multiple columns with similar values

I have a dataframe with a few thousand rows and multiple columns.我有一个包含几千行和多列的数据框。

I want to reduce the size of this dataframe by removing rows which values of columns A, C and D are too similar, and column D is equal.我想通过删除列 A、C 和 D 的值太相似且列 D 相等的行来减小此数据框的大小。 In other words, where the difference of the values of each column is below a threshold.换句话说,每列的值的差异低于阈值。 This threshold can be different for each column.对于每一列,此阈值可能不同。 Also, I want to keep the row with the highest value based on column E.另外,我想根据 E 列保留具有最高值的行。

I have a code that populates a new dataframe and checks if each row of the old dataframe is too similar to anything already present in the new dataframe.我有一个填充新数据帧的代码,并检查旧数据帧的每一行是否与新数据帧中已经存在的任何内容过于相似。

cols = [list-of-column-names]
df = pd.DataFrame(l, columns=cols) # l is a list of thousands of lists with values to populate the dataframe
df.sort_values(by='E', ascending=False, inplace=True) # Sort based on the column I want to keep the highest value

new_df = pd.DataFrame(columns=cols) # Create new dataframe
for i, line in df.iterrows(): # Iterate over old dataframe
    if len(
            new_df[
                (THRESHOLD_A1 < abs(1e6 * (new_df['A'] - line['A']) / new_df['A'])) & (
                        abs(1e6 * (new_df['A'] - line['A']) / new_df['A']) < THRESHOLD_A2) &
                (new_df['E'] == line['E']) &
                (abs(new_df['C'] - line['C']) < THRESHOLD_C) &
                ((abs(new_df['D'] - line['D']) / new_df['D']) < THRESHOLD_D)
            ]
    ) == 0: # If no row in the new dataframe was found, then append this row to new dataframe
        new_df = pd.concat([new_df, pd.DataFrame([line])])

However, this code is too slow.但是,这段代码太慢了。 Is there a better way to write this?有没有更好的方法来写这个?


Example:例子:

d = {
  'A': [1, 1.5, 1.4, 7, 8],
  'B': [10, 11, 11.5, 13, 14],
  'C': [50, 50.5, 50.6, 60, 70],
  'D': [5, 4, 5, 3, 2],
  'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
    
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
'''
# Values are too similar if absolute difference between values of same column is below threshold
# values in column D needs to be the same
# If two rows are too similar, preserve the one with highest value in E column

This would remove row 0

Rational
row 0 ['E'] == row 2 ['E']
abs(row 0 ['A'] - row 2 ['A']) == 0.4 <= THRESHOLD_A
abs(row 0 ['B'] - row 2 ['B']) == 1.5 <= THRESHOLD_B
abs(row 0 ['C'] - row 2 ['C']) == 0.6 <= THRESHOLD_C
    
row 2 has the highest value in column 'D' == 102.
'''

     A     B     C  D    E
0  1.0  10.0  50.0  5  100
1  1.5  11.0  50.5  4  101
2  1.4  11.5  50.6  5  102
3  7.0  13.0  60.0  3  103
4  8.0  14.0  70.0  2  104

Output:输出:

     A     B     C  D    E
0  1.5  11.0  50.5  4  101
1  1.4  11.5  50.6  5  102
2  7.0  13.0  60.0  3  103
3  8.0  14.0  70.0  2  104

One approach is to round number to specific number of floating point then apply group by on the result.一种方法是将数字四舍五入到特定数量的浮点数,然后对结果应用 group by。 The problem is that the thresholds can not be set very arbitrary way.问题是阈值不能以非常随意的方式设置。 Demonstration:示范:

df = pd.DataFrame({'A':[1.514, 1.54, 4.86], 'B': [1.51, 3.58, 4.01], 'C': [1.21, 8.52,4.21], 'E': [5,10,20]})
es = df.groupby(df['A'].round(1)).apply(lambda x: x['E'].max())
df[df['E'].isin(es)]

Output:输出:

    A   B   C   E
1   1.54    3.58    8.52    10
2   4.86    4.01    4.21    20

Here as the A value of two rows are similar, based on the column E we keep the one with the max value.这里因为两行的 A 值相似,所以我们根据 E 列保留最大值。 A better option is to use pd.cut :更好的选择是使用pd.cut

t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
pd.cut(df['B'], bins = t2)

which provides you groups.它为您提供组。 Based on your new sample data:根据您的新样本数据:

0    (0.993, 2.0]
1    (0.993, 2.0]
2    (0.993, 2.0]
3      (6.0, 7.0]
4      (7.0, 8.0]
Name: A, dtype: category

Demonstration on your sample:您的样品演示:

THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1

d = {'A': [1, 1.5, 1.4, 7, 8],
     'B': [10, 11, 11.5, 13, 14],
     'C': [50, 50.5, 50.6, 60, 70],
     'D': [5, 4, 5, 3, 2],
     'E': [100, 101, 102, 103, 104]
}

df = pd.DataFrame(d)
cs = df.columns

t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
c1 = pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
c2 = pd.cut(df['B'], bins = t2)
t3 = int((df['C'].max()-df['C'].min())/THRESHOLD_C)
c3 = pd.cut(df['C'], bins = t3)

df['c1'] = c1
df['c2'] = c2
df['c3'] = c3
t = df.groupby(['c1', 'c2', 'c3', 'D'])['E'].apply(lambda x: x.max()).reset_index()['E']
es = t[t.notna()]

df[df['E'].isin(es)][cs]

Output:输出:

    A   B   C   D   E
1   1.5 11.0    50.5    4   101
2   1.4 11.5    50.6    5   102
3   7.0 13.0    60.0    3   103
4   8.0 14.0    70.0    2   104

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据pandas中多列的值从Dataframe中选择行 - Selecting rows from a Dataframe based on values from multiple columns in pandas 根据pandas中多列中的值从Dataframe中选择行 - Selecting rows from a Dataframe based on values in multiple columns in pandas 根据熊猫中MULTIPLE列中的值从DataFrame中选择行 - Select rows from a DataFrame based on values in a MULTIPLE columns in pandas 如何根据两列中的值删除 pandas dataframe 中的行? - How to remove rows in a pandas dataframe based on values in two columns? 根据列值删除Pandas中的DataFrame行 - 要删除的多个值 - Deleting DataFrame rows in Pandas based on column value - multiple values to remove 子集根据另一个数据帧的值在多个列上进行pandas数据帧 - Subset pandas dataframe on multiple columns based on values from another dataframe 根据值从特定范围列中删除Pandas DataFrame中的行 - Deleting rows in Pandas DataFrame based on values, from a specific range columns Pandas 根据另一个数据框中 2 列的值过滤行 - Pandas filter rows based on values from 2 columns in another dataframe 熊猫基于多个条件从数据框中删除行,而没有for循环 - pandas remove rows from dataframe based on multiple conditions without for loops 根据列中的值过滤pandas数据帧中的行 - Filter rows in pandas dataframe based on values in columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM