如何计算单词出现次数（从特定列表中的单词）并将结果存储在 Python 中的 Pandas Dataframe 中的新列中？

Question

I currently have a list of words about MMA.我目前有一个关于 MMA 的单词列表。

I want to create a new column in my Pandas Dataframe called 'MMA Related Word Count'.我想在我的 Pandas Dataframe 中创建一个名为“MMA 相关字数统计”的新专栏。 I want to analyze the column 'Speech' for each row and sum up how often words (from the list under here) occurred within the speech.我想分析每一行的“语音”列，并总结语音中单词（来自此处列表）出现的频率。 Does anyone know the best way to do this?有谁知道最好的方法吗？ I'd love to hear it, thanks in advance!我很想听听，在此先感谢！

Please take a look at my dataframe.请看我的dataframe。

CODE EXAMPLE:代码示例：

import pandas as pd

mma_related_words = ['mma', 'ju jitsu', 'boxing']

data = {
  "Name": ['Dana White', 'Triple H'],
  "Speech": ['mma is a fantastic sport. ju jitsu makes you better as a person.', 'Boxing sucks. Professional dancing is much better.']
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

CURRENT DATAFRAME:当前 DATAFRAME：

Name名称	Speech演讲
Dana White白大拿	mma is a fantastic sport. mma 是一项很棒的运动。 ju jitsu makes you better as a person.柔术让你成为一个更好的人。
Triple H三重H	boxing sucks.拳击很烂。 Professional wrestling is much better.职业摔跤要好得多。

-- --

EXPECTED OUTPUT: Exactly same as above.预期 OUTPUT：与上面完全相同。 But at right side new column with 'MMA Related Word Count'.但在右侧的新列中有“MMA 相关字数统计”。 For Dana White: value 2. For Triple HI want value 1.对于 Dana White：值 2。对于 Triple HI，值 1。

Answer 1

You can use a regex with str.count :您可以将正则表达式与str.count一起使用：

import re
regex = '|'.join(map(re.escape, mma_related_words))
# 'mma|ju\\ jitsu|boxing'

df['Word Count'] = df['Speech'].str.count(regex, flags=re.I)
# or
# df['Word Count'] = df['Speech'].str.count(r'(?i)'+regex)

output: output：

         Name                                             Speech  Word Count
0  Dana White  mma is a fantastic sport. ju jitsu makes you b...           2
1    Triple H  Boxing sucks. Professional dancing is much bet...           1

Answer 2

Using simple loop in apply lambda function shall work;在 apply lambda function 中使用简单循环应该可以工作； Try this;尝试这个;

def fun(string):
    cnt = 0
    for w in mma_related_words:
        if w.lower() in string.lower():
            cnt = cnt + 1
    return cnt

df['MMA Related Word Count'] = df['Speech'].apply(lambda x: fun(string=x))

Same can also be written as;同样也可以写成；

df['MMA Related Word Count1'] = df['Speech'].apply(lambda x: sum([1 for w in mma_related_words if w.lower() in str(x).lower()]))

Output of df; df的Output；

如何计算单词出现次数（从特定列表中的单词）并将结果存储在 Python 中的 Pandas Dataframe 中的新列中？

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-09-29 17:04:43

解决方案2
0 2022-09-29 17:01:13

如何计算单词出现次数（从特定列表中的单词）并将结果存储在 Python 中的 Pandas Dataframe 中的新列中？

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-09-29 17:04:43

解决方案2 0 2022-09-29 17:01:13

解决方案1
2 已采纳 2022-09-29 17:04:43

解决方案2
0 2022-09-29 17:01:13