[英]In pandas, check if a master string contains a string from a list, if it does remove the substring from the master string and add it to a new column
我有两个DataFrame:
df1=
A
0 Black Prada zebra leather Large
1 green Gucci striped Canvas small
2 blue Prada Monogram calf leather XL
df2=
color pattern material size
0 black zebra leather small
1 green striped canvas xl
2 yellow checkered calf leather medium
3 orange monogram
4 white plain
5 pinstripe
我想将df2中的列与df1(控制大小写不均和空格)进行比较,如果存在匹配项,请将匹配项放在df1中的新列中并将其从A中删除。这应该是完全匹配项这样“小牛皮”就不会错误地与“皮革”相匹配。 因此,结果将仅是A中剩余的不匹配子字符串:
df3=
A color pattern material size
0 Prada Large black zebra leather NaN
1 Gucci green striped canvas small
2 Prada blue Monogram calf leather XL
我已经尝试过使用for循环,但是我的数据集很大,我觉得那没有完全利用熊猫。 我也尝试了contains和isin并没有成功。 是使用.extract并将df2列转换为regex的唯一解决方案吗? 谢谢!
更新
听起来您可能想对从df1
中搜索df2
列的方式进行排名(我现在在下面称该search
)。
在这里,它检查您的search
字符串中的最大单词百分比与df2
列中的单词匹配的百分比。 如果满足某些要求的阈值,则将其删除。
我已经对其进行了测试,并且可以正常工作,但是您可能需要对正则表达式进行一些尝试。
import pandas
def perc_match(src, s):
'''Return percentage of words in s found in src'''
# http://stackoverflow.com/a/26985301/943773
import re
s = ' | '.join([r'\b{}\b'.format(x) for x in s.split()])
r = re.compile(s, flags=re.I | re.X)
return len(r.findall(src))/len(src)
search = ['Black Prada zebra leather Large',
'green Gucci striped Canvas small',
'blue Prada Monogram calf leather XL']
d2 = {'color':['black', 'green', 'yellow', 'orange', 'white',''],
'pattern':['zebra', 'striped', 'checkered', 'monogram', 'plain',
'pinstripe'],
'material':['leather', 'canvas', 'calf leather','','',''],
'size':['small', 'xl', 'medium','','','']}
df2 = pandas.DataFrame(d2)
# Strip whitespace and make all lower case
strip_lower = lambda x: x.strip().lower()
search = list(map(strip_lower, search))
df2 = df2.applymap(strip_lower)
# Combine all columns to single string for each row
df2['full_str'] = df2.apply(lambda row: ' '.join(row), axis=1)
# Min percent matching
min_thresh = 0.1
# Calculate the percentage match for each row of dataframe
rm_ind = list()
for i in range(len(search)):
s = search[i]
# If you want you could save these `perc_matches` for later
perc_matches = df2['full_str'].apply(perc_match, args=(s,))
# Mark for removal if above threshold
if perc_matches.max() > min_thresh:
rm_ind.append(i)
# Remove indices from `search`
for i in rm_ind:
del search[i]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.