简体   繁体   中英

Counting a list of words in a list of strings using python

So I have a pandas dataframe with rows of tokenized strings in a column named story. I also have a list of words in a list called selected_words. I am trying to count the instances of any of the selected_words in each of the rows in the column story.

The code I used before that had worked is

CCwordsCount=df4.story.str.count('|'.join(selected_words))

This is now giving me NaN values for every row.

Below is the first few rows of the column story in df4. The dataframe contains a little over 400 rows of NYTimes Articles.

0      [it, was, a, curious, choice, for, the, good, ...
1      [when, he, was, a, yale, law, school, student,...
2      [video, bitcoin, has, real, world, investors, ...
3      [bitcoin, s, wild, ride, may, not, have, been,...
4      [amid, the, incense, cheap, art, and, herbal, ...
5      [san, francisco, eight, years, ago, ernie, all...

This is the list of selected_words

selected_words = ['accept', 'believe', 'trust', 'accepted', 'accepts', 'trusts', 'believes', \
                  'acceptance', 'trusted', 'trusting', 'accepting', 'believes', 'believing', 'believed',\
                 'normal', 'normalize', ' normalized', 'routine', 'belief', 'faith', 'confidence', 'adoption', \
                  'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves']

Link to my df4.csv file

.find() function can be useful. And this can be implemented in many different ways. If you don't have any other purpose for the raw article and it can be a bunch of string. Then try this, you can also put them in a dictionary and loop over.

def find_words(text, words):
    return [word for word in words if word in text]

sentences = "0  [it, was, a, curious, choice, for, the, good, 1      [when, he, was, a, yale, law, school, student, 2      [video, bitcoin, has, real, world, investors, 3      [bitcoin, s, wild, ride, may, not, have, been, 4      [amid, the, incense, cheap, art, and, herbal, 5      [san, francisco, eight, years, ago, ernie, all"

search_keywords=['accept', 'believe', 'trust', 'accepted', 'accepts', 'trusts', 'believes', \
                  'acceptance', 'trusted', 'trusting', 'accepting', 'believes', 'believing', 'believed',\
                 'normal', 'normalize', ' normalized', 'routine', 'belief', 'faith', 'confidence', 'adoption', \
                  'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves', 'good']

found = find_words(sentences, search_keywords)

print(found)

Note: I didn't have panda data frame in mind whine I create this.

Each story entry appears to be a list containing a string.

Use map to get the string from the list before applying str as follows.

CCwordsCount = df4.story.map(lambda x: ''.join(x[1:-1])).str.count('|'.join(selected_words))

print(CCwordsCount.head(20))   # Show first 20 story results

Output

0      1
1      2
2      5
3      7
4      0
5      1
6     10
7      8
8      2
9      2
10     8
11     0
12     0
13     2
14     0
15     4
16     2
17     9
18     0
19     0
Name: story, dtype: int64

Explanation

Each story was in a list converted to a string, so basically:

"['it', 'was', 'a', 'curious', 'choice', 'for', 'the', 'good', 'wife', ...]"

Converted to list of words by dropping '[' and ']' and concatenating words

map(lambda x: ''.join(x[1:-1]))

This results in words separated by commas in quotes. For first row this results in the string:

'it', 'was', 'a', 'curious', 'choice', 'for', ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM