简体   繁体   English

在列表中的pandas列中找到关键字匹配项的数量

[英]Find number of keyword matches in pandas column that is in a list

I have a pandas dataframe that looks like the following: 我有一个熊猫数据框,如下所示:

Type        Keywords 
----        --------
Animal      [Pigeon, Bird, Raccoon, Dog, Cat]
Pet         [Dog, Cat, Hamster]
Pest        [Rat, Mouse, Raccoon, Pigeon]
Farm        [Chicken, Horse, Cow, Sheep]
Predator    [Wolf, Fox, Raccoon]

Let's say that I have the following string: 假设我有以下字符串:

input = "There is a dead rat and raccoon in my pool"

Given that I tokenize the string and remove stop-words so that it becomes 鉴于我标记了字符串并删除了停用词,因此它变成了

input = [Dead, Rat, Raccoon, Pool]

I need to go through each row and find the rows that have the highest number of keyword matches. 我需要遍历每一行,并找到关键字匹配次数最多的行。 With the given example, the results would look like the following: 对于给定的示例,结果将如下所示:

Type        Keywords                            Matches
----        --------                            -------
Animal      [Pigeon, Bird, Raccoon, Dog, Cat]   1
Pet         [Dog, Cat, Hamster]                 0
Pest        [Rat, Mouse, Raccoon, Pigeon]       2
Farm        [Chicken, Horse, Cow, Sheep]        0
Predator    [Wolf, Fox, Raccoon]                1

The output would be the top three Type names that have the highest number of matches. 输出将是匹配次数最多的前三个Type名称。

In the above case, since the "Pest" category has the highest number of matches, it would be selected as the highest match. 在上述情况下,由于“害虫”类别的匹配数最高,因此将其选择为最高匹配项。 Additionally both the Animal and Predator categories would be selected. 此外,还将选择“动物”和“捕食者”类别。 The output in order would thus be: 因此,输出顺序为:

output = [Pest, Animal, Predator]

Doing this task with nested for loops is easy, but since I have thousands of these kinds of rows, I'm looking for a better solution. 使用嵌套的for循环执行此任务很容易,但是由于我有成千上万的此类行,因此我正在寻找更好的解决方案。 (Additionally for some reason I have encountered a lot of bugs when using non in-built functions with pandas, perhaps it's because of vectorization?) (另外由于某种原因,当我将非内置函数与熊猫一起使用时,我遇到了很多错误,也许是因为矢量化?)

I looked at the groupby and isin functions that are inbuilt in pandas, but as far as I could tell they would not be able to get me to the output that I want (I would not be surprised at all if I am incorrect in this assumption). 我查看了熊猫内置的groupby和isin函数,但据我所知,它们将无法使我达到所需的输出(如果我在此假设中不正确,我将不会感到惊讶。 )。

I next investigated the usage of sets and hashmaps with pandas, but unfortunately my coding knowledge and current ability is not yet proficient enough to craft a solid solution. 接下来,我研究了熊猫使用集合和哈希图的情况,但是不幸的是,我的编码知识和当前能力还不足以熟练地制定可靠的解决方案。 This StackOverflow link in particular got me much closer to what I wanted, though it didn't find the top three match row names. 这个StackOverflow链接尤其使我更接近想要的链接,尽管它没有找到前三名匹配的行名。

I would greatly appreciate any help or advice. 我将不胜感激任何帮助或建议。

You may check isin 您可以检查isin

df['Matches']=pd.DataFrame(df.Keywords.values.tolist()).isin(s).sum(1)


df.loc[df['Matches']>0,'Type'].values.tolist()

It won't be very efficient to store and operate on lists in a DataFrame, that being said, we can use set intersection here: 在DataFrame中的列表上进行存储和操作并不是很有效,也就是说,我们可以在此处使用set相交:

Setup 设定

s = set(['Dead', 'Rat', 'Raccoon', 'Pool'])

Now using a list comprehension (faster than apply ): 现在使用列表推导(比apply更快):

out = df.assign(Matches=[len(set(el) & s) for el in df.Keywords])

<!- ->

       Type                           Keywords  Matches
0    Animal  [Pigeon, Bird, Raccoon, Dog, Cat]        1
1       Pet                [Dog, Cat, Hamster]        0
2      Pest      [Rat, Mouse, Raccoon, Pigeon]        2
3      Farm       [Chicken, Horse, Cow, Sheep]        0
4  Predator               [Wolf, Fox, Raccoon]        1

To find the three rows with the most matches: 要查找最匹配的三行:

out.loc[out.Matches.nlargest(3).index].Type.tolist()

['Pest', 'Animal', 'Predator']

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 查找包含整数列的 pandas dataframe 与包含整数列表列的 dataframe 匹配的所有实例 - Find all instances where a pandas dataframe containing a column of int's matches a dataframe with a column of a list of ints Pandas:返回匹配条件的第一列号 - Pandas: Return first column number that matches a condition 在pandas中查找与数组匹配的列名 - Find column name in pandas that matches an array 从 pandas dataframe 的列中查找与另一个字符串列表中的任何项目匹配的字符串 - find a string from column in pandas dataframe which matches any item from another list of strings 如果第二列与给定的列表pandas匹配,则替换dataframe列中的值 - Replace values in dataframe column if second column matches a given list pandas 如果列表中的字符串与另一列匹配,则创建一个 pandas 列 - Create an pandas column if a string from a list matches from another column Pandas:使用列值匹配的列表填充新列 - Pandas: Populate new column with list from matches on column value 在 URL 的 pandas 列中查找部分匹配的单词,直接在 https:// 之后 - Find partial matches of words in pandas column of URLs, directly after https:// 遍历多个 Pandas 列表类型系列并找到匹配项 - Iterate through multiple Pandas list-type series and find matches 如果与单词匹配,则用列表替换 pandas Dataframe 列值 - Replace pandas Dataframe column values with the list if it matches a word
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM