[英]Find number of keyword matches in pandas column that is in a list
I have a pandas dataframe that looks like the following: 我有一个熊猫数据框,如下所示:
Type Keywords
---- --------
Animal [Pigeon, Bird, Raccoon, Dog, Cat]
Pet [Dog, Cat, Hamster]
Pest [Rat, Mouse, Raccoon, Pigeon]
Farm [Chicken, Horse, Cow, Sheep]
Predator [Wolf, Fox, Raccoon]
Let's say that I have the following string: 假设我有以下字符串:
input = "There is a dead rat and raccoon in my pool"
Given that I tokenize the string and remove stop-words so that it becomes 鉴于我标记了字符串并删除了停用词,因此它变成了
input = [Dead, Rat, Raccoon, Pool]
I need to go through each row and find the rows that have the highest number of keyword matches. 我需要遍历每一行,并找到关键字匹配次数最多的行。 With the given example, the results would look like the following:
对于给定的示例,结果将如下所示:
Type Keywords Matches
---- -------- -------
Animal [Pigeon, Bird, Raccoon, Dog, Cat] 1
Pet [Dog, Cat, Hamster] 0
Pest [Rat, Mouse, Raccoon, Pigeon] 2
Farm [Chicken, Horse, Cow, Sheep] 0
Predator [Wolf, Fox, Raccoon] 1
The output would be the top three Type names that have the highest number of matches. 输出将是匹配次数最多的前三个Type名称。
In the above case, since the "Pest" category has the highest number of matches, it would be selected as the highest match. 在上述情况下,由于“害虫”类别的匹配数最高,因此将其选择为最高匹配项。 Additionally both the Animal and Predator categories would be selected.
此外,还将选择“动物”和“捕食者”类别。 The output in order would thus be:
因此,输出顺序为:
output = [Pest, Animal, Predator]
Doing this task with nested for loops is easy, but since I have thousands of these kinds of rows, I'm looking for a better solution. 使用嵌套的for循环执行此任务很容易,但是由于我有成千上万的此类行,因此我正在寻找更好的解决方案。 (Additionally for some reason I have encountered a lot of bugs when using non in-built functions with pandas, perhaps it's because of vectorization?)
(另外由于某种原因,当我将非内置函数与熊猫一起使用时,我遇到了很多错误,也许是因为矢量化?)
I looked at the groupby and isin functions that are inbuilt in pandas, but as far as I could tell they would not be able to get me to the output that I want (I would not be surprised at all if I am incorrect in this assumption). 我查看了熊猫内置的groupby和isin函数,但据我所知,它们将无法使我达到所需的输出(如果我在此假设中不正确,我将不会感到惊讶。 )。
I next investigated the usage of sets and hashmaps with pandas, but unfortunately my coding knowledge and current ability is not yet proficient enough to craft a solid solution. 接下来,我研究了熊猫使用集合和哈希图的情况,但是不幸的是,我的编码知识和当前能力还不足以熟练地制定可靠的解决方案。 This StackOverflow link in particular got me much closer to what I wanted, though it didn't find the top three match row names.
这个StackOverflow链接尤其使我更接近想要的链接,尽管它没有找到前三名匹配的行名。
I would greatly appreciate any help or advice. 我将不胜感激任何帮助或建议。
You may check isin
您可以检查
isin
df['Matches']=pd.DataFrame(df.Keywords.values.tolist()).isin(s).sum(1)
df.loc[df['Matches']>0,'Type'].values.tolist()
It won't be very efficient to store and operate on lists in a DataFrame, that being said, we can use set intersection here: 在DataFrame中的列表上进行存储和操作并不是很有效,也就是说,我们可以在此处使用set相交:
Setup 设定
s = set(['Dead', 'Rat', 'Raccoon', 'Pool'])
Now using a list comprehension (faster than apply
): 现在使用列表推导(比
apply
更快):
out = df.assign(Matches=[len(set(el) & s) for el in df.Keywords])
<!- ->
Type Keywords Matches
0 Animal [Pigeon, Bird, Raccoon, Dog, Cat] 1
1 Pet [Dog, Cat, Hamster] 0
2 Pest [Rat, Mouse, Raccoon, Pigeon] 2
3 Farm [Chicken, Horse, Cow, Sheep] 0
4 Predator [Wolf, Fox, Raccoon] 1
To find the three rows with the most matches: 要查找最匹配的三行:
out.loc[out.Matches.nlargest(3).index].Type.tolist()
['Pest', 'Animal', 'Predator']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.