繁体   English   中英

将多个字符串从一个数据帧中的行匹配到另一个数据帧中的行

[英]Matching multiple strings from rows in one dataframe to rows in another

我有两个数据帧A和B。A较小时有500行,B较大时有20000行。 A的列是:

A.columns = ['name','company','model','family']

和B的列是:

B.columns = ["title", "price"]

B中的title列是一个大乱七八糟的字符串,但确实包含A中3列的字符串,即公司,模型和家族(忘记“名称”列,因为A本身的名称是公司,模型和家族的组合)。 我需要将A中的每一行与B中的每一行进行匹配。这是我的解决方案:

out=pd.DataFrame(columns={0,1,2,3,4,5})
out.columns=["name", 'company', 'model', 'family', 'title', 'price']

for index, row in A.iterrows():
    lst=[A.loc[index,'family'], A.loc[index,'model'], A.loc[index,'company']]
    for i, r in B.iterrows():
        if all(w in B.loc[i,'title'] for w in lst):        
            out.loc[index,'name']=A.loc[index,'name']
            out.loc[index,'company']=A.loc[index,'company']
            out.loc[index,'model']=A.loc[index,'model']
            out.loc[index,'family']=A.loc[index,'family']

            out.loc[index,'title']=B.loc[i,'title']
            out.loc[index,'price']=B.loc[i,'price']
            break

这会非常低效地完成工作,并且需要很长时间。 我知道这是一个“记录链接”问题,人们正在研究它的准确性和速度,但是在Pandas中是否有更快,更有效的方法? 如果仅检查标题中第一项中的一两项,速度会更快,但我担心的是它将降低准确性。

在准确性方面,我宁愿得到更少的比赛,而不是错误的比赛。

同样,A.dtypes和B.dtypes表示两个数据框的列都是对象:

title           object
price           object
dtype: object

我感谢任何评论。 谢谢

*********更新***********

这两个文件的一部分可以在以下位置找到: A B

我对它们做了一些清洁工作:

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.colors as mcol
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import math

A = pd.read_csv('A.txt', delimiter=',', header=None) 
A.columns = ['product name','manufacturer','model','family','announced date']

for index, row in A.iterrows():    
    A.loc[index, "product name"] = A.loc[index, "product name"].split('"')[3]
    A.loc[index, "manufacturer"] = A.loc[index, "manufacturer"].split('"')[1]
    A.loc[index, "model"] = A.loc[index, "model"].split('"')[1]
    if 'family' in A.loc[index, "family"]:
        A.loc[index, "family"] = A.loc[index, "family"].split('"')[1]
    if 'announced' in A.loc[index, "family"]:
        A.loc[index, "announced date"] = A.loc[index, "family"]
        A.loc[index, "family"] = ''
    A.loc[index, "announced date"] = A.loc[index, "announced date"].split('"')[1]

A.columns=['product name','manufacturer','model','family','announced date']
A.reset_index()

B = pd.read_csv('B.txt', error_bad_lines=False, warn_bad_lines=False, header=None) 

B.columns = ["title", "manufacturer", "currency", "price"]
pd.options.display.max_colwidth=200

for index, row in B.iterrows():
    B.loc[index,'manufacturer']=B.loc[index,'manufacturer'].split('"')[1]
    B.loc[index,'currency']=B.loc[index,'currency'].split('"')[1]
    B.loc[index,'price']=B.loc[index,'price'].split('"')[1]
    B.loc[index,'title']=B.loc[index,'title'].split('"')[3]

然后是答案中建议的安德鲁的方法:

def match_strs(row):
    return np.where(B.title.str.contains(row['manufacturer']) & \
                    B.title.str.contains(row['family']) & \
                    B.title.str.contains(row['model']))[0][0]

A['merge_idx'] = A.apply(match_strs, axis='columns')

(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
  .drop('merge_idx', 1)
  .dropna())

就像我说的,发生了一些我无法弄清楚的复杂情况。 非常感谢您的帮助

以下是一些可使用的示例数据:

import numpy as np
import pandas as pd

# make A df
manufacturer = ['A','B','C']
model = ['foo','bar','baz']
family = ['X','Y','Z']
name = ['{}_{}_{}'.format(manufacturer[i],model[i],family[i]) for i in range(len(company))]
A = pd.DataFrame({'name':name,'manufacturer': manufacturer,'model':model,'family':family})

# A
  manufacturer family model     name
     0       A      X   foo  A_foo_X
     1       B      Y   bar  B_bar_Y
     2       C      Z   baz  C_baz_Z

# make B df
title = ['blahblahblah']
title.extend( ['{}_{}'.format(n, 'blahblahblah') for n in name] )
B = pd.DataFrame({'title':title,'price':np.random.randint(1,100,4)})

# B
   price                 title
0     62          blahblahblah
1      7  A_foo_X_blahblahblah
2     92  B_bar_Y_blahblahblah
3     24  C_baz_Z_blahblahblah

我们可以根据您的匹配条件创建一个与AB中的行索引匹配的函数,并将它们存储在新列中:

def match_strs(row):
    match_result = (np.where(B.title.str.contains(row['manufacturer']) & \
                             B.title.str.contains(row['family']) & \
                             B.title.str.contains(row['model'])))
    if not len(match_result[0]):
        return None
    return match_result[0][0]

A['merge_idx'] = A.apply(match_strs, axis='columns')

然后合并AB

(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
  .drop('merge_idx', 1)
  .dropna())

输出:

  manufacturer family model     name  price                 title
     0       A      X   foo  A_foo_X     23  A_foo_X_blahblahblah
     1       B      Y   bar  B_bar_Y     14  B_bar_Y_blahblahblah
     2       C      Z   baz  C_baz_Z     19  C_baz_Z_blahblahblah

那是您要找的东西吗?

请注意,如果您希望将行保留在B中,即使A中没有匹配项,只需在merge末尾删除.dropna()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM