简体   繁体   English

Pandas:在文本列中搜索关键字列表并标记它

[英]Pandas: search list of keywords in the text column and tag it

I have bag of words as elements in a list format. 我有一些单词作为列表格式的元素。 I am trying to search if each and every single of these words appear in the pandas data frame ONLY if it 'startswith' the element in the list. 我试图搜索这些单词中的每一个单词是否仅出现在pandas数据框中,如果它'以“列表中的元素开头”。 I have tried 'startswith' and 'contains' to compare. 我试过'startswith'和'contains'进行比较。

Code: 码:

import pandas as pd
# list of words to search for
searchwords = ['harry','harry potter','secret garden']

# Data
l1 = [1, 2, 3,4,5]
l2 = ['Harry Potter is a great book',
      'Harry Potter is very famous',
      'I enjoyed reading Harry Potter series',
      'LOTR is also a great book along',
      'Have you read Secret Garden as well?'
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()

# Preview df:
    id  text
0   1   harry potter is a great book
1   2   harry potter is very famous
2   3   i enjoyed reading harry potter series
3   4   lotr is also a great book along
4   5   have you read secret garden as well?

Try #1: 尝试#1:

When I run this command it picks it up and gives me the results through out the text column. Not what I am looking for. I just used to check if I am doing things right for an example reasons for my understanding.
df[df['text'].str.contains('|'.join(searchwords))]

Try #2: When I run this command it returns nothing. 尝试#2:当我运行此命令时,它什么都不返回。 Why is that? 这是为什么? I am doing something wrong? 我做错了什么? When I search 'harry' as single it works, but not when I pass in the list of elements. 当我将'harry'作为单一搜索时,它可以正常工作,但是当我传入元素列表时却不行。

df[df['text'].str.startswith('harry')] # works with single string.
df[df['text'].str.startswith('|'.join(searchwords))] # returns nothing! 

Use startswith with a tuple 使用带有tuple startswith

Ex: 例如:

searchwords = ['harry','harry potter','secret garden']

# Data
l1 = [1, 2, 3,4,5]
l2 = ['Harry Potter is a great book',
      'Harry Potter is very famous',
      'I enjoyed reading Harry Potter series',
      'LOTR is also a great book along',
      'Have you read Secret Garden as well?'
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()

print(df[df['text'].str.startswith(tuple(searchwords))] )

Output: 输出:

   id                          text
0   1  harry potter is a great book
1   2   harry potter is very famous

since startswith accepts str and no regex, use str.findall 因为startswith接受str而没有正则表达式,所以使用str.findall

df[df['text'].str.findall('^(?:'+'|'.join(searchwords) + ')').apply(len) > 0]

Output 产量

   id                          text
0   1  harry potter is a great book
1   2   harry potter is very famous

You could pass a tuple in startswith function to check for multiple words See this str.startswith with a list of strings to test for 您可以在startswith函数中传递一个元组以检查多个单词请参阅此str.startswith以及要测试的字符串列表

In your case, you can do 在你的情况下,你可以做到

df['text'].str.startswith(tuple(searchwords))

Out:
0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM