[英]How to count the word occurence (from words in specific list) and store the results in a new column in a Pandas Dataframe in Python?
I currently have a list of words about MMA.我目前有一个关于 MMA 的单词列表。
I want to create a new column in my Pandas Dataframe called 'MMA Related Word Count'.我想在我的 Pandas Dataframe 中创建一个名为“MMA 相关字数统计”的新专栏。 I want to analyze the column 'Speech' for each row and sum up how often words (from the list under here) occurred within the speech.
我想分析每一行的“语音”列,并总结语音中单词(来自此处列表)出现的频率。 Does anyone know the best way to do this?
有谁知道最好的方法吗? I'd love to hear it, thanks in advance!
我很想听听,在此先感谢!
Please take a look at my dataframe.请看我的dataframe。
CODE EXAMPLE:代码示例:
import pandas as pd
mma_related_words = ['mma', 'ju jitsu', 'boxing']
data = {
"Name": ['Dana White', 'Triple H'],
"Speech": ['mma is a fantastic sport. ju jitsu makes you better as a person.', 'Boxing sucks. Professional dancing is much better.']
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
CURRENT DATAFRAME:当前 DATAFRAME:
Name![]() |
Speech![]() |
---|---|
Dana White![]() |
mma is a fantastic sport. ![]() ![]() |
Triple H![]() |
boxing sucks.![]() ![]() |
-- --
EXPECTED OUTPUT: Exactly same as above.预期 OUTPUT:与上面完全相同。 But at right side new column with 'MMA Related Word Count'.
但在右侧的新列中有“MMA 相关字数统计”。 For Dana White: value 2. For Triple HI want value 1.
对于 Dana White:值 2。对于 Triple HI,值 1。
You can use a regex with str.count
:您可以将正则表达式与
str.count
一起使用:
import re
regex = '|'.join(map(re.escape, mma_related_words))
# 'mma|ju\\ jitsu|boxing'
df['Word Count'] = df['Speech'].str.count(regex, flags=re.I)
# or
# df['Word Count'] = df['Speech'].str.count(r'(?i)'+regex)
output: output:
Name Speech Word Count
0 Dana White mma is a fantastic sport. ju jitsu makes you b... 2
1 Triple H Boxing sucks. Professional dancing is much bet... 1
Using simple loop in apply lambda function shall work;在 apply lambda function 中使用简单循环应该可以工作; Try this;
尝试这个;
def fun(string):
cnt = 0
for w in mma_related_words:
if w.lower() in string.lower():
cnt = cnt + 1
return cnt
df['MMA Related Word Count'] = df['Speech'].apply(lambda x: fun(string=x))
Same can also be written as;同样也可以写成;
df['MMA Related Word Count1'] = df['Speech'].apply(lambda x: sum([1 for w in mma_related_words if w.lower() in str(x).lower()]))
Output of df; df的Output;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.