根据具有相似值的多列从熊猫数据框中删除行

Question

我有一个包含几千行和多列的数据框。

我想通过删除列 A、C 和 D 的值太相似且列 D 相等的行来减小此数据框的大小。 换句话说，每列的值的差异低于阈值。 对于每一列，此阈值可能不同。 另外，我想根据 E 列保留具有最高值的行。

我有一个填充新数据帧的代码，并检查旧数据帧的每一行是否与新数据帧中已经存在的任何内容过于相似。

cols = [list-of-column-names]
df = pd.DataFrame(l, columns=cols) # l is a list of thousands of lists with values to populate the dataframe
df.sort_values(by='E', ascending=False, inplace=True) # Sort based on the column I want to keep the highest value

new_df = pd.DataFrame(columns=cols) # Create new dataframe
for i, line in df.iterrows(): # Iterate over old dataframe
    if len(
            new_df[
                (THRESHOLD_A1 < abs(1e6 * (new_df['A'] - line['A']) / new_df['A'])) & (
                        abs(1e6 * (new_df['A'] - line['A']) / new_df['A']) < THRESHOLD_A2) &
                (new_df['E'] == line['E']) &
                (abs(new_df['C'] - line['C']) < THRESHOLD_C) &
                ((abs(new_df['D'] - line['D']) / new_df['D']) < THRESHOLD_D)
            ]
    ) == 0: # If no row in the new dataframe was found, then append this row to new dataframe
        new_df = pd.concat([new_df, pd.DataFrame([line])])

但是，这段代码太慢了。 有没有更好的方法来写这个？

例子：

d = {
  'A': [1, 1.5, 1.4, 7, 8],
  'B': [10, 11, 11.5, 13, 14],
  'C': [50, 50.5, 50.6, 60, 70],
  'D': [5, 4, 5, 3, 2],
  'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
    
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
'''
# Values are too similar if absolute difference between values of same column is below threshold
# values in column D needs to be the same
# If two rows are too similar, preserve the one with highest value in E column

This would remove row 0

Rational
row 0 ['E'] == row 2 ['E']
abs(row 0 ['A'] - row 2 ['A']) == 0.4 <= THRESHOLD_A
abs(row 0 ['B'] - row 2 ['B']) == 1.5 <= THRESHOLD_B
abs(row 0 ['C'] - row 2 ['C']) == 0.6 <= THRESHOLD_C
    
row 2 has the highest value in column 'D' == 102.
'''

     A     B     C  D    E
0  1.0  10.0  50.0  5  100
1  1.5  11.0  50.5  4  101
2  1.4  11.5  50.6  5  102
3  7.0  13.0  60.0  3  103
4  8.0  14.0  70.0  2  104

输出：

     A     B     C  D    E
0  1.5  11.0  50.5  4  101
1  1.4  11.5  50.6  5  102
2  7.0  13.0  60.0  3  103
3  8.0  14.0  70.0  2  104

Answer 1

一种方法是将数字四舍五入到特定数量的浮点数，然后对结果应用 group by。 问题是阈值不能以非常随意的方式设置。 示范：

df = pd.DataFrame({'A':[1.514, 1.54, 4.86], 'B': [1.51, 3.58, 4.01], 'C': [1.21, 8.52,4.21], 'E': [5,10,20]})
es = df.groupby(df['A'].round(1)).apply(lambda x: x['E'].max())
df[df['E'].isin(es)]

输出：

    A   B   C   E
1   1.54    3.58    8.52    10
2   4.86    4.01    4.21    20

这里因为两行的 A 值相似，所以我们根据 E 列保留最大值。 更好的选择是使用pd.cut ：

t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
pd.cut(df['B'], bins = t2)

它为您提供组。 根据您的新样本数据：

0    (0.993, 2.0]
1    (0.993, 2.0]
2    (0.993, 2.0]
3      (6.0, 7.0]
4      (7.0, 8.0]
Name: A, dtype: category

您的样品演示：

THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1

d = {'A': [1, 1.5, 1.4, 7, 8],
     'B': [10, 11, 11.5, 13, 14],
     'C': [50, 50.5, 50.6, 60, 70],
     'D': [5, 4, 5, 3, 2],
     'E': [100, 101, 102, 103, 104]
}

df = pd.DataFrame(d)
cs = df.columns

t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
c1 = pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
c2 = pd.cut(df['B'], bins = t2)
t3 = int((df['C'].max()-df['C'].min())/THRESHOLD_C)
c3 = pd.cut(df['C'], bins = t3)

df['c1'] = c1
df['c2'] = c2
df['c3'] = c3
t = df.groupby(['c1', 'c2', 'c3', 'D'])['E'].apply(lambda x: x.max()).reset_index()['E']
es = t[t.notna()]

df[df['E'].isin(es)][cs]

输出：

    A   B   C   D   E
1   1.5 11.0    50.5    4   101
2   1.4 11.5    50.6    5   102
3   7.0 13.0    60.0    3   103
4   8.0 14.0    70.0    2   104

根据具有相似值的多列从熊猫数据框中删除行

问题描述

1 个解决方案

解决方案1
0 2022-07-19 04:45:12

根据具有相似值的多列从熊猫数据框中删除行

问题描述

1 个解决方案

解决方案1 0 2022-07-19 04:45:12

解决方案1
0 2022-07-19 04:45:12