[英]How do to a partial match check across strings in two different pandas columns?
I have two dataframes, one with 300 names and one with 2000. I want to check if all of the words in each of the 300 names are contained in the 2000 in any iteration.我有两个数据框,一个有 300 个名字,一个有 2000 个名字。我想检查 300 个名字中每个名字中的所有单词是否都包含在任何迭代中的 2000 个中。 For example:
例如:
Name 1: Mark, Alex, Smith,姓名 1:马克、亚历克斯、史密斯、
Name 2: Mark, Joseph, Smith, Alex, the, first姓名 2:Mark, Joseph, Smith, Alex, the, first
Dataframe 1 Dataframe 1
Name 1![]() |
---|
'Mark', 'Alex', 'Smith' ![]() |
Dataframe 2 Dataframe 2
Name 2![]() |
---|
'Mark', 'Joseph', 'Alex', 'Smith', 'the', First' ![]() |
As you can see, the column in dataframe 2 contains all of the words from column in dataframe 1, but additional words in the name too.如您所见,dataframe 2 中的列包含 dataframe 1 中列中的所有单词,但名称中也包含其他单词。
My query should match here, because Name 2 contains all of the words from name 1 even though it is not an exact match.我的查询应该在这里匹配,因为名称 2 包含名称 1 中的所有单词,即使它不是完全匹配。 Each of the names is split into individual words in each cell.
每个名称在每个单元格中都被拆分成单独的单词。
Ideally, I would run a function across dataframe 2 which contains 2,000 names and see if any of those names have contain all of the words from dataframe 1.理想情况下,我会在 dataframe 2 中运行 function,其中包含 2,000 个名称,并查看这些名称中是否有任何名称包含 dataframe 1 中的所有单词。
Edit: Someone kindly pointed out in the comments that what I am trying to say, is can I find if Name 1 is a subset of Name 2.编辑:有人在评论中友善地指出,我想说的是,我可以找到 Name 1 是否是 Name 2 的子集。
Assuming that each of your dataframes have a column with list of strings:假设您的每个数据框都有一个包含字符串列表的列:
>>> df1 = pd.DataFrame({
"Name 1": [['Mark', 'Alex', 'Smith'], ['S1', 'S2', 'S3']],
})
>>> df2 = pd.DataFrame({
"Name 2": [['Mark', 'Joseph', 'Alex', 'Smith', 'the', 'First'], ['S3', 'S4', 'S5']],
})
You could merge both dataframes first:您可以先合并两个数据框:
>>> df = pd.merge(df1, df2, left_index=True, right_index=True)
>>> print(df)
Name 1 Name 2
0 [Mark, Alex, Smith] [Mark, Joseph, Alex, Smith, the, First]
1 [S1, S2, S3] [S3, S4, S5]
And now apply this function to identify if the subset condition is True
:现在应用此 function 来确定子集条件是否为
True
:
>>> df = df.apply(lambda row: set(row["Name 1"]).issubset(set(row["Name 2"])), axis=1)
>>> df
0 True
1 False
dtype: bool
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.