如何对两个不同的 pandas 列中的字符串进行部分匹配检查？

Question

I have two dataframes, one with 300 names and one with 2000. I want to check if all of the words in each of the 300 names are contained in the 2000 in any iteration.我有两个数据框，一个有 300 个名字，一个有 2000 个名字。我想检查 300 个名字中每个名字中的所有单词是否都包含在任何迭代中的 2000 个中。 For example:例如：

Name 1: Mark, Alex, Smith,姓名 1：马克、亚历克斯、史密斯、

Name 2: Mark, Joseph, Smith, Alex, the, first姓名 2：Mark, Joseph, Smith, Alex, the, first

Dataframe 1 Dataframe 1

Name 1姓名 1
'Mark', 'Alex', 'Smith' “马克”、“亚历克斯”、“史密斯”

Dataframe 2 Dataframe 2

Name 2姓名 2
'Mark', 'Joseph', 'Alex', 'Smith', 'the', First' 'Mark', 'Joseph', 'Alex', 'Smith', 'the', First'

As you can see, the column in dataframe 2 contains all of the words from column in dataframe 1, but additional words in the name too.如您所见，dataframe 2 中的列包含 dataframe 1 中列中的所有单词，但名称中也包含其他单词。

My query should match here, because Name 2 contains all of the words from name 1 even though it is not an exact match.我的查询应该在这里匹配，因为名称 2 包含名称 1 中的所有单词，即使它不是完全匹配。 Each of the names is split into individual words in each cell.每个名称在每个单元格中都被拆分成单独的单词。

Ideally, I would run a function across dataframe 2 which contains 2,000 names and see if any of those names have contain all of the words from dataframe 1.理想情况下，我会在 dataframe 2 中运行 function，其中包含 2,000 个名称，并查看这些名称中是否有任何名称包含 dataframe 1 中的所有单词。

Edit: Someone kindly pointed out in the comments that what I am trying to say, is can I find if Name 1 is a subset of Name 2.编辑：有人在评论中友善地指出，我想说的是，我可以找到 Name 1 是否是 Name 2 的子集。

Answer 1

Assuming that each of your dataframes have a column with list of strings:假设您的每个数据框都有一个包含字符串列表的列：

>>> df1 = pd.DataFrame({
        "Name 1": [['Mark', 'Alex', 'Smith'], ['S1', 'S2', 'S3']],
    })
>>> df2 = pd.DataFrame({
        "Name 2": [['Mark', 'Joseph', 'Alex', 'Smith', 'the', 'First'], ['S3', 'S4', 'S5']],
    })

You could merge both dataframes first:您可以先合并两个数据框：

>>> df = pd.merge(df1, df2, left_index=True, right_index=True)
>>> print(df)
                Name 1                                   Name 2
0  [Mark, Alex, Smith]  [Mark, Joseph, Alex, Smith, the, First]
1         [S1, S2, S3]                             [S3, S4, S5]

And now apply this function to identify if the subset condition is True :现在应用此 function 来确定子集条件是否为True ：

>>> df = df.apply(lambda row: set(row["Name 1"]).issubset(set(row["Name 2"])), axis=1)
>>> df
0     True
1    False
dtype: bool

如何对两个不同的 pandas 列中的字符串进行部分匹配检查？

问题描述

1 个解决方案

解决方案1
0 2022-02-25 14:55:52

如何对两个不同的 pandas 列中的字符串进行部分匹配检查？

问题描述

1 个解决方案

解决方案1 0 2022-02-25 14:55:52

解决方案1
0 2022-02-25 14:55:52