So I have a pandas dataframe with rows of tokenized strings in a column named story. I also have a list of words in a list called selected_words. I am trying to count the instances of any of the selected_words in each of the rows in the column story.
The code I used before that had worked is
CCwordsCount=df4.story.str.count('|'.join(selected_words))
This is now giving me NaN values for every row.
Below is the first few rows of the column story in df4. The dataframe contains a little over 400 rows of NYTimes Articles.
0 [it, was, a, curious, choice, for, the, good, ...
1 [when, he, was, a, yale, law, school, student,...
2 [video, bitcoin, has, real, world, investors, ...
3 [bitcoin, s, wild, ride, may, not, have, been,...
4 [amid, the, incense, cheap, art, and, herbal, ...
5 [san, francisco, eight, years, ago, ernie, all...
This is the list of selected_words
selected_words = ['accept', 'believe', 'trust', 'accepted', 'accepts', 'trusts', 'believes', \
'acceptance', 'trusted', 'trusting', 'accepting', 'believes', 'believing', 'believed',\
'normal', 'normalize', ' normalized', 'routine', 'belief', 'faith', 'confidence', 'adoption', \
'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves']
.find()
function can be useful. And this can be implemented in many different ways. If you don't have any other purpose for the raw article and it can be a bunch of string. Then try this, you can also put them in a dictionary and loop over.
def find_words(text, words):
return [word for word in words if word in text]
sentences = "0 [it, was, a, curious, choice, for, the, good, 1 [when, he, was, a, yale, law, school, student, 2 [video, bitcoin, has, real, world, investors, 3 [bitcoin, s, wild, ride, may, not, have, been, 4 [amid, the, incense, cheap, art, and, herbal, 5 [san, francisco, eight, years, ago, ernie, all"
search_keywords=['accept', 'believe', 'trust', 'accepted', 'accepts', 'trusts', 'believes', \
'acceptance', 'trusted', 'trusting', 'accepting', 'believes', 'believing', 'believed',\
'normal', 'normalize', ' normalized', 'routine', 'belief', 'faith', 'confidence', 'adoption', \
'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves', 'good']
found = find_words(sentences, search_keywords)
print(found)
Note: I didn't have panda data frame in mind whine I create this.
Each story entry appears to be a list containing a string.
Use map to get the string from the list before applying str as follows.
CCwordsCount = df4.story.map(lambda x: ''.join(x[1:-1])).str.count('|'.join(selected_words))
print(CCwordsCount.head(20)) # Show first 20 story results
Output
0 1
1 2
2 5
3 7
4 0
5 1
6 10
7 8
8 2
9 2
10 8
11 0
12 0
13 2
14 0
15 4
16 2
17 9
18 0
19 0
Name: story, dtype: int64
Explanation
Each story was in a list converted to a string, so basically:
"['it', 'was', 'a', 'curious', 'choice', 'for', 'the', 'good', 'wife', ...]"
Converted to list of words by dropping '[' and ']' and concatenating words
map(lambda x: ''.join(x[1:-1]))
This results in words separated by commas in quotes. For first row this results in the string:
'it', 'was', 'a', 'curious', 'choice', 'for', ...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.