简体   繁体   English

Pandas 中的条件词频计数

[英]Conditional word frequency count in Pandas

I have a dataframe like below:我有一个如下所示的数据框:

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

I want to count the number of words in the speech column but only for the words from a pre-defined list.我想计算语音列中的单词数,但只计算预定义列表中的单词。 For example, the list is:例如,列表是:

wordlist = ['much', 'good','right']

I want to generate a new column which shows the frequency of these three words in each row.我想生成一个新列,显示每行中这三个单词的频率。 My expected output is therefore:因此,我的预期输出是:

     speaker                   speech                               words
0   Adam          Thank you very much and good afternoon.             2
1   Ben        Let me clarify that because I want to make sur...      1
2   Clair        By now you should have received a copy of our ...    1

I tried:我试过:

df['total'] = 0
for word in df['speech'].str.split():
    if word in wordlist: 
        df['total'] += 1

But I after running it, the total column is always zero.但是我运行它后, total列始终为零。 I am wondering what's wrong with my code?我想知道我的代码有什么问题?

You could use the following vectorised approach:您可以使用以下矢量化方法:

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

wordlist = ['much', 'good','right']

df['total'] = df['speech'].str.count(r'\b|\b'.join(wordlist))

Which gives:这使:

>>> df
  speaker                                             speech  total
0    Adam            Thank you very much and good afternoon.      2
1     Ben  Let me clarify that because I want to make sur...      1
2   Clair              By now you should have some good rest      1

This is a much faster (runtime wise) solution, if you have a very large list and a large data frame to search through.如果您有一个非常大的列表和一个大数据框要搜索,这是一个更快(运行时明智)的解决方案。

I guess it is because it takes advantage of a dictionary (which takes O(N) to construct and O(1) to search through).我想这是因为它利用了字典(需要 O(N) 来构建和 O(1) 来搜索)。 Performance wise, regex search is slower.性能方面,正则表达式搜索速度较慢。

import pandas as pd
from collections import Counter

def occurrence_counter(target_string, search_list):
    data = dict(Counter(target_string.split()))
    count = 0
    for key in search_list:
        if key in data:
            count+=data[key]
    return count

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

wordlist = ['much', 'good','right']

df['speech'].apply(lambda x: occurrence_counter(x, wordlist))
import pandas as pd

data = {'speaker': ['Adam', 'Ben', 'Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

wordlist = ['much', 'good', 'right']

df["speech"] = df["speech"].str.split()
df = df.explode("speech")
counts = df[df.speech.isin(wordlist)].groupby("speaker").size()
print(counts)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM