[英]Exclude rows in a dataframe based on matching values in rows from another dataframe
[英]Matching multiple strings from rows in one dataframe to rows in another
我有两个数据帧A和B。A较小时有500行,B较大时有20000行。 A的列是:
A.columns = ['name','company','model','family']
和B的列是:
B.columns = ["title", "price"]
B中的title列是一个大乱七八糟的字符串,但确实包含A中3列的字符串,即公司,模型和家族(忘记“名称”列,因为A本身的名称是公司,模型和家族的组合)。 我需要将A中的每一行与B中的每一行进行匹配。这是我的解决方案:
out=pd.DataFrame(columns={0,1,2,3,4,5})
out.columns=["name", 'company', 'model', 'family', 'title', 'price']
for index, row in A.iterrows():
lst=[A.loc[index,'family'], A.loc[index,'model'], A.loc[index,'company']]
for i, r in B.iterrows():
if all(w in B.loc[i,'title'] for w in lst):
out.loc[index,'name']=A.loc[index,'name']
out.loc[index,'company']=A.loc[index,'company']
out.loc[index,'model']=A.loc[index,'model']
out.loc[index,'family']=A.loc[index,'family']
out.loc[index,'title']=B.loc[i,'title']
out.loc[index,'price']=B.loc[i,'price']
break
这会非常低效地完成工作,并且需要很长时间。 我知道这是一个“记录链接”问题,人们正在研究它的准确性和速度,但是在Pandas中是否有更快,更有效的方法? 如果仅检查标题中第一项中的一两项,速度会更快,但我担心的是它将降低准确性。
在准确性方面,我宁愿得到更少的比赛,而不是错误的比赛。
同样,A.dtypes和B.dtypes表示两个数据框的列都是对象:
title object
price object
dtype: object
我感谢任何评论。 谢谢
*********更新***********
我对它们做了一些清洁工作:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.colors as mcol
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import math
A = pd.read_csv('A.txt', delimiter=',', header=None)
A.columns = ['product name','manufacturer','model','family','announced date']
for index, row in A.iterrows():
A.loc[index, "product name"] = A.loc[index, "product name"].split('"')[3]
A.loc[index, "manufacturer"] = A.loc[index, "manufacturer"].split('"')[1]
A.loc[index, "model"] = A.loc[index, "model"].split('"')[1]
if 'family' in A.loc[index, "family"]:
A.loc[index, "family"] = A.loc[index, "family"].split('"')[1]
if 'announced' in A.loc[index, "family"]:
A.loc[index, "announced date"] = A.loc[index, "family"]
A.loc[index, "family"] = ''
A.loc[index, "announced date"] = A.loc[index, "announced date"].split('"')[1]
A.columns=['product name','manufacturer','model','family','announced date']
A.reset_index()
B = pd.read_csv('B.txt', error_bad_lines=False, warn_bad_lines=False, header=None)
B.columns = ["title", "manufacturer", "currency", "price"]
pd.options.display.max_colwidth=200
for index, row in B.iterrows():
B.loc[index,'manufacturer']=B.loc[index,'manufacturer'].split('"')[1]
B.loc[index,'currency']=B.loc[index,'currency'].split('"')[1]
B.loc[index,'price']=B.loc[index,'price'].split('"')[1]
B.loc[index,'title']=B.loc[index,'title'].split('"')[3]
然后是答案中建议的安德鲁的方法:
def match_strs(row):
return np.where(B.title.str.contains(row['manufacturer']) & \
B.title.str.contains(row['family']) & \
B.title.str.contains(row['model']))[0][0]
A['merge_idx'] = A.apply(match_strs, axis='columns')
(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
.drop('merge_idx', 1)
.dropna())
就像我说的,发生了一些我无法弄清楚的复杂情况。 非常感谢您的帮助
以下是一些可使用的示例数据:
import numpy as np
import pandas as pd
# make A df
manufacturer = ['A','B','C']
model = ['foo','bar','baz']
family = ['X','Y','Z']
name = ['{}_{}_{}'.format(manufacturer[i],model[i],family[i]) for i in range(len(company))]
A = pd.DataFrame({'name':name,'manufacturer': manufacturer,'model':model,'family':family})
# A
manufacturer family model name
0 A X foo A_foo_X
1 B Y bar B_bar_Y
2 C Z baz C_baz_Z
# make B df
title = ['blahblahblah']
title.extend( ['{}_{}'.format(n, 'blahblahblah') for n in name] )
B = pd.DataFrame({'title':title,'price':np.random.randint(1,100,4)})
# B
price title
0 62 blahblahblah
1 7 A_foo_X_blahblahblah
2 92 B_bar_Y_blahblahblah
3 24 C_baz_Z_blahblahblah
我们可以根据您的匹配条件创建一个与A
和B
中的行索引匹配的函数,并将它们存储在新列中:
def match_strs(row):
match_result = (np.where(B.title.str.contains(row['manufacturer']) & \
B.title.str.contains(row['family']) & \
B.title.str.contains(row['model'])))
if not len(match_result[0]):
return None
return match_result[0][0]
A['merge_idx'] = A.apply(match_strs, axis='columns')
然后合并A
和B
:
(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
.drop('merge_idx', 1)
.dropna())
输出:
manufacturer family model name price title
0 A X foo A_foo_X 23 A_foo_X_blahblahblah
1 B Y bar B_bar_Y 14 B_bar_Y_blahblahblah
2 C Z baz C_baz_Z 19 C_baz_Z_blahblahblah
那是您要找的东西吗?
请注意,如果您希望将行保留在B中,即使A中没有匹配项,只需在merge
末尾删除.dropna()
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.