简体   繁体   中英

Check if column in data frame contain any word from a list + adding count (Python)

Here is the input dataframe,

df_data = pd.DataFrame({'A':[2,1,3], 'content': ['the dog is sleeping', 'my name is Dude', 'i am who i am']})

and list of words,

words_list= ['dog', 'Dude','sleeping', 'i']

now, i know how to create a new column with indication if i have the word that i want, something like this -

df_data['new'] = df_data.apply(lambda row: True if any([item in row['content'] for item in words_list]) else False, axis = 1)

the point is that i want also to have count for the words... as example, in row number 2 and row number 3 i have 2 words from my list so i want to have a new column with the value 2, etc.

thank you!

try this, pandas.Series.str.findall to extract the matches.

import pandas as pd
import re

df_data = pd.DataFrame({'A':[2,1,3], 'content': ['the dog is sleeping', 'my name is Dude', 'i am who i am']})
words_list= ['dog', 'Dude','sleeping', 'i']

search_ = re.compile("\\b%s\\b" % "\\b|\\b".join(words_list))

df_data['matches'] = df_data.content.str.findall(search_)
df_data['count'] = df_data['matches'].apply(len)

  A              content          matches  count
0  2  the dog is sleeping  [dog, sleeping]      2
1  1      my name is Dude           [Dude]      1
2  3        i am who i am           [i, i]      2

First, I think you need to modify your initial function as it may provide an incorrect output.

For example:

words_list= ['do']
df_data['new'] = df_data.apply(lambda row: True if any([item in row['content'] for item in words_list]) else False, axis = 1)

Results in

   A              content    new
0  2  the dog is sleeping   True
1  1      my name is Dude  False
2  3        i am who i am  False

Thought, there is no word 'do' in the first row. It can be fixed by splitting row content into list:

row['content'].split()

The count can be set simply with sum function on boolean array:

df_data['new'] = df_data.apply(lambda row: sum([item in row['content'].split() for item in words_list]), axis = 1)

Output:

   A              content  new
0  2  the dog is sleeping    2
1  1      my name is Dude    1
2  3        i am who i am    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM