[英]Remove rows from pandas dataframe based on multiple columns with similar values
I have a dataframe with a few thousand rows and multiple columns.我有一个包含几千行和多列的数据框。
I want to reduce the size of this dataframe by removing rows which values of columns A, C and D are too similar, and column D is equal.我想通过删除列 A、C 和 D 的值太相似且列 D 相等的行来减小此数据框的大小。 In other words, where the difference of the values of each column is below a threshold.
换句话说,每列的值的差异低于阈值。 This threshold can be different for each column.
对于每一列,此阈值可能不同。 Also, I want to keep the row with the highest value based on column E.
另外,我想根据 E 列保留具有最高值的行。
I have a code that populates a new dataframe and checks if each row of the old dataframe is too similar to anything already present in the new dataframe.我有一个填充新数据帧的代码,并检查旧数据帧的每一行是否与新数据帧中已经存在的任何内容过于相似。
cols = [list-of-column-names]
df = pd.DataFrame(l, columns=cols) # l is a list of thousands of lists with values to populate the dataframe
df.sort_values(by='E', ascending=False, inplace=True) # Sort based on the column I want to keep the highest value
new_df = pd.DataFrame(columns=cols) # Create new dataframe
for i, line in df.iterrows(): # Iterate over old dataframe
if len(
new_df[
(THRESHOLD_A1 < abs(1e6 * (new_df['A'] - line['A']) / new_df['A'])) & (
abs(1e6 * (new_df['A'] - line['A']) / new_df['A']) < THRESHOLD_A2) &
(new_df['E'] == line['E']) &
(abs(new_df['C'] - line['C']) < THRESHOLD_C) &
((abs(new_df['D'] - line['D']) / new_df['D']) < THRESHOLD_D)
]
) == 0: # If no row in the new dataframe was found, then append this row to new dataframe
new_df = pd.concat([new_df, pd.DataFrame([line])])
However, this code is too slow.但是,这段代码太慢了。 Is there a better way to write this?
有没有更好的方法来写这个?
Example:例子:
d = {
'A': [1, 1.5, 1.4, 7, 8],
'B': [10, 11, 11.5, 13, 14],
'C': [50, 50.5, 50.6, 60, 70],
'D': [5, 4, 5, 3, 2],
'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
'''
# Values are too similar if absolute difference between values of same column is below threshold
# values in column D needs to be the same
# If two rows are too similar, preserve the one with highest value in E column
This would remove row 0
Rational
row 0 ['E'] == row 2 ['E']
abs(row 0 ['A'] - row 2 ['A']) == 0.4 <= THRESHOLD_A
abs(row 0 ['B'] - row 2 ['B']) == 1.5 <= THRESHOLD_B
abs(row 0 ['C'] - row 2 ['C']) == 0.6 <= THRESHOLD_C
row 2 has the highest value in column 'D' == 102.
'''
A B C D E
0 1.0 10.0 50.0 5 100
1 1.5 11.0 50.5 4 101
2 1.4 11.5 50.6 5 102
3 7.0 13.0 60.0 3 103
4 8.0 14.0 70.0 2 104
Output:输出:
A B C D E
0 1.5 11.0 50.5 4 101
1 1.4 11.5 50.6 5 102
2 7.0 13.0 60.0 3 103
3 8.0 14.0 70.0 2 104
One approach is to round number to specific number of floating point then apply group by on the result.一种方法是将数字四舍五入到特定数量的浮点数,然后对结果应用 group by。 The problem is that the thresholds can not be set very arbitrary way.
问题是阈值不能以非常随意的方式设置。 Demonstration:
示范:
df = pd.DataFrame({'A':[1.514, 1.54, 4.86], 'B': [1.51, 3.58, 4.01], 'C': [1.21, 8.52,4.21], 'E': [5,10,20]})
es = df.groupby(df['A'].round(1)).apply(lambda x: x['E'].max())
df[df['E'].isin(es)]
Output:输出:
A B C E
1 1.54 3.58 8.52 10
2 4.86 4.01 4.21 20
Here as the A value of two rows are similar, based on the column E we keep the one with the max value.这里因为两行的 A 值相似,所以我们根据 E 列保留最大值。 A better option is to use
pd.cut
:更好的选择是使用
pd.cut
:
t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
pd.cut(df['B'], bins = t2)
which provides you groups.它为您提供组。 Based on your new sample data:
根据您的新样本数据:
0 (0.993, 2.0]
1 (0.993, 2.0]
2 (0.993, 2.0]
3 (6.0, 7.0]
4 (7.0, 8.0]
Name: A, dtype: category
Demonstration on your sample:您的样品演示:
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
d = {'A': [1, 1.5, 1.4, 7, 8],
'B': [10, 11, 11.5, 13, 14],
'C': [50, 50.5, 50.6, 60, 70],
'D': [5, 4, 5, 3, 2],
'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
cs = df.columns
t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
c1 = pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
c2 = pd.cut(df['B'], bins = t2)
t3 = int((df['C'].max()-df['C'].min())/THRESHOLD_C)
c3 = pd.cut(df['C'], bins = t3)
df['c1'] = c1
df['c2'] = c2
df['c3'] = c3
t = df.groupby(['c1', 'c2', 'c3', 'D'])['E'].apply(lambda x: x.max()).reset_index()['E']
es = t[t.notna()]
df[df['E'].isin(es)][cs]
Output:输出:
A B C D E
1 1.5 11.0 50.5 4 101
2 1.4 11.5 50.6 5 102
3 7.0 13.0 60.0 3 103
4 8.0 14.0 70.0 2 104
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.