在熊猫中，检查主字符串是否包含列表中的字符串，是否确实从主字符串中删除了子字符串并将其添加到新列中

Question

我有两个DataFrame：

df1=
    A    
0   Black Prada zebra leather Large   
1   green Gucci striped Canvas small   
2   blue Prada Monogram calf leather XL

df2=
    color    pattern   material     size
0   black    zebra     leather      small
1   green    striped   canvas       xl
2   yellow   checkered calf leather medium
3   orange   monogram
4   white    plain
5            pinstripe

我想将df2中的列与df1（控制大小写不均和空格）进行比较，如果存在匹配项，请将匹配项放在df1中的新列中并将其从A中删除。这应该是完全匹配项这样“小牛皮”就不会错误地与“皮革”相匹配。 因此，结果将仅是A中剩余的不匹配子字符串：

df3=
    A            color    pattern   material     size
0   Prada Large  black    zebra     leather      NaN
1   Gucci        green    striped   canvas       small
2   Prada        blue     Monogram  calf leather XL

我已经尝试过使用for循环，但是我的数据集很大，我觉得那没有完全利用熊猫。 我也尝试了contains和isin并没有成功。 是使用.extract并将df2列转换为regex的唯一解决方案吗？ 谢谢！

Answer 1

更新

听起来您可能想对从df1中搜索df2列的方式进行排名（我现在在下面称该search ）。

在这里，它检查您的search字符串中的最大单词百分比与df2列中的单词匹配的百分比。 如果满足某些要求的阈值，则将其删除。

我已经对其进行了测试，并且可以正常工作，但是您可能需要对正则表达式进行一些尝试。

import pandas

def perc_match(src, s):
    '''Return percentage of words in s found in src'''
    # http://stackoverflow.com/a/26985301/943773
    import re
    s = ' | '.join([r'\b{}\b'.format(x) for x in s.split()])
    r = re.compile(s, flags=re.I | re.X)

    return len(r.findall(src))/len(src)


search = ['Black Prada zebra leather Large',
          'green Gucci striped Canvas small',
          'blue Prada Monogram calf leather XL']

d2 = {'color':['black', 'green', 'yellow', 'orange', 'white',''],
      'pattern':['zebra', 'striped', 'checkered', 'monogram', 'plain',
                 'pinstripe'],
      'material':['leather', 'canvas', 'calf leather','','',''],
      'size':['small', 'xl', 'medium','','','']}

df2 = pandas.DataFrame(d2)

# Strip whitespace and make all lower case
strip_lower = lambda x: x.strip().lower()
search = list(map(strip_lower, search))
df2 = df2.applymap(strip_lower)

# Combine all columns to single string for each row
df2['full_str'] = df2.apply(lambda row: ' '.join(row), axis=1)

# Min percent matching
min_thresh = 0.1

# Calculate the percentage match for each row of dataframe
rm_ind = list()
for i in range(len(search)):
    s = search[i]
    # If you want you could save these `perc_matches` for later
    perc_matches = df2['full_str'].apply(perc_match, args=(s,))
    # Mark for removal if above threshold
    if perc_matches.max() > min_thresh:
        rm_ind.append(i)

# Remove indices from `search`
for i in rm_ind:
    del search[i]

在熊猫中，检查主字符串是否包含列表中的字符串，是否确实从主字符串中删除了子字符串并将其添加到新列中

问题描述

1 个解决方案

解决方案1
0 2017-03-16 18:34:54

在熊猫中，检查主字符串是否包含列表中的字符串，是否确实从主字符串中删除了子字符串并将其添加到新列中

问题描述

1 个解决方案

解决方案1 0 2017-03-16 18:34:54

解决方案1
0 2017-03-16 18:34:54