[英]Efficient Cartesian product algo Pandas DFs / partial match between columns
我有兩個數據框
df1
name
xyz limited
abc private
lmn limited
pqrlimited
abc def xyz limited
abc private limited
df2
flag tag
E private
A limited
而所需的 output 是
Output:
name flag tag
xyz limited A limited
abc private E private
lmn limited A limited
pqrlimited A limited
abc def xyz limited A limited
abc private limited A limited
abc private limited E private
我的代碼:
df1['tmp'] = 1
df2['tmp'] = 1
df3 = pd.merge(df1,df2, on=['tmp'])
df3 = df3.drop('tmp',axis=1)
df3 = df3[df3.apply(lambda x: x['tag'] in (x['name']), axis=1)]
但實際上,兩個數據框都包含數百萬條記錄。 有人可以建議最有效的方法來解決這個問題。
將split
與merge
一起使用:
df1['tag'] = df1['name'].str.split(' ', expand=True)[1]
df1.merge(df2)
#or
df1['flag'] = df1['tag'].map(df2.set_index('tag')['flag'])
#or if the strings not seperated then
df1['tag'] = df1['name'].str.findall('|'.join(set(df2['tag'].tolist()))).str[0]
更新的解決方案:
df1 = (df1.reset_index()
.merge(df1.name.str.findall('|'.join(set(df2['tag'].tolist()))).explode().reset_index(name='tag'),
on='index')
.drop('index', axis=1))
df=df1.merge(df2)
你可以這樣做:
regx = '|'.join(df2['tag'])
df1['tag'] = df1['name'].str.extract(f'({regx})')
df1['flag'] = df1['tag'].map(df2.set_index('tag')['flag'])
print(df1)
Output:
name tag flag
0 xyz limited limited A
1 abc private private E
2 lmn limited limited A
3 pqrlimited limited A
4 abc def xyz limited limited A
5 abc private limited private E
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.