[英]Selecting rows from a Dataframe based on values from multiple columns in pandas
[英]Remove rows from pandas dataframe based on multiple columns with similar values
我有一个包含几千行和多列的数据框。
我想通过删除列 A、C 和 D 的值太相似且列 D 相等的行来减小此数据框的大小。 换句话说,每列的值的差异低于阈值。 对于每一列,此阈值可能不同。 另外,我想根据 E 列保留具有最高值的行。
我有一个填充新数据帧的代码,并检查旧数据帧的每一行是否与新数据帧中已经存在的任何内容过于相似。
cols = [list-of-column-names]
df = pd.DataFrame(l, columns=cols) # l is a list of thousands of lists with values to populate the dataframe
df.sort_values(by='E', ascending=False, inplace=True) # Sort based on the column I want to keep the highest value
new_df = pd.DataFrame(columns=cols) # Create new dataframe
for i, line in df.iterrows(): # Iterate over old dataframe
if len(
new_df[
(THRESHOLD_A1 < abs(1e6 * (new_df['A'] - line['A']) / new_df['A'])) & (
abs(1e6 * (new_df['A'] - line['A']) / new_df['A']) < THRESHOLD_A2) &
(new_df['E'] == line['E']) &
(abs(new_df['C'] - line['C']) < THRESHOLD_C) &
((abs(new_df['D'] - line['D']) / new_df['D']) < THRESHOLD_D)
]
) == 0: # If no row in the new dataframe was found, then append this row to new dataframe
new_df = pd.concat([new_df, pd.DataFrame([line])])
但是,这段代码太慢了。 有没有更好的方法来写这个?
例子:
d = {
'A': [1, 1.5, 1.4, 7, 8],
'B': [10, 11, 11.5, 13, 14],
'C': [50, 50.5, 50.6, 60, 70],
'D': [5, 4, 5, 3, 2],
'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
'''
# Values are too similar if absolute difference between values of same column is below threshold
# values in column D needs to be the same
# If two rows are too similar, preserve the one with highest value in E column
This would remove row 0
Rational
row 0 ['E'] == row 2 ['E']
abs(row 0 ['A'] - row 2 ['A']) == 0.4 <= THRESHOLD_A
abs(row 0 ['B'] - row 2 ['B']) == 1.5 <= THRESHOLD_B
abs(row 0 ['C'] - row 2 ['C']) == 0.6 <= THRESHOLD_C
row 2 has the highest value in column 'D' == 102.
'''
A B C D E
0 1.0 10.0 50.0 5 100
1 1.5 11.0 50.5 4 101
2 1.4 11.5 50.6 5 102
3 7.0 13.0 60.0 3 103
4 8.0 14.0 70.0 2 104
输出:
A B C D E
0 1.5 11.0 50.5 4 101
1 1.4 11.5 50.6 5 102
2 7.0 13.0 60.0 3 103
3 8.0 14.0 70.0 2 104
一种方法是将数字四舍五入到特定数量的浮点数,然后对结果应用 group by。 问题是阈值不能以非常随意的方式设置。 示范:
df = pd.DataFrame({'A':[1.514, 1.54, 4.86], 'B': [1.51, 3.58, 4.01], 'C': [1.21, 8.52,4.21], 'E': [5,10,20]})
es = df.groupby(df['A'].round(1)).apply(lambda x: x['E'].max())
df[df['E'].isin(es)]
输出:
A B C E
1 1.54 3.58 8.52 10
2 4.86 4.01 4.21 20
这里因为两行的 A 值相似,所以我们根据 E 列保留最大值。 更好的选择是使用pd.cut
:
t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
pd.cut(df['B'], bins = t2)
它为您提供组。 根据您的新样本数据:
0 (0.993, 2.0]
1 (0.993, 2.0]
2 (0.993, 2.0]
3 (6.0, 7.0]
4 (7.0, 8.0]
Name: A, dtype: category
您的样品演示:
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
d = {'A': [1, 1.5, 1.4, 7, 8],
'B': [10, 11, 11.5, 13, 14],
'C': [50, 50.5, 50.6, 60, 70],
'D': [5, 4, 5, 3, 2],
'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
cs = df.columns
t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
c1 = pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
c2 = pd.cut(df['B'], bins = t2)
t3 = int((df['C'].max()-df['C'].min())/THRESHOLD_C)
c3 = pd.cut(df['C'], bins = t3)
df['c1'] = c1
df['c2'] = c2
df['c3'] = c3
t = df.groupby(['c1', 'c2', 'c3', 'D'])['E'].apply(lambda x: x.max()).reset_index()['E']
es = t[t.notna()]
df[df['E'].isin(es)][cs]
输出:
A B C D E
1 1.5 11.0 50.5 4 101
2 1.4 11.5 50.6 5 102
3 7.0 13.0 60.0 3 103
4 8.0 14.0 70.0 2 104
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.