计算熊猫数据帧中每个特定单词的出现次数

Question

I want to count the number of occurrences of each of certain words in a data frame.我想计算数据框中每个特定单词的出现次数。 I currently do it using str.contains :我目前使用str.contains做到这str.contains ：

a = df2[df2['col1'].str.contains("sample")].groupby('col2').size()
n = a.apply(lambda x: 1).sum()

Is there a method to match regular expression and get the count of occurrences?是否有匹配正则表达式并获取出现次数的方法？ In my case I have a large dataframe and I want to match around 100 strings.就我而言，我有一个大数据框，我想匹配大约 100 个字符串。

Answer 1

Update: Original answer counts those rows which contain a substring.更新：原始答案计算那些包含子字符串的行。

To count all the occurrences of a substring you can use .str.count :要计算子字符串的所有出现次数，您可以使用.str.count ：

In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])

In [22]: df.words.str.count("he|wo")
Out[22]:
0    1
1    1
2    2
Name: words, dtype: int64

In [23]: df.words.str.count("he|wo").sum()
Out[23]: 4

The str.contains method accepts a regular expression: str.contains方法接受一个正则表达式：

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array

Parameters
----------
pat : string
    Character sequence or regular expression
case : boolean, default True
    If True, case sensitive
flags : int, default 0 (no flags)
    re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.

For example:例如：

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])

In [12]: df
Out[12]:
   words
0  hello
1  world

In [13]: df.words.str.contains(r'[hw]')
Out[13]:
0    True
1    True
Name: words, dtype: bool

In [14]: df.words.str.contains(r'he|wo')
Out[14]:
0    True
1    True
Name: words, dtype: bool

To count the occurences you can just sum this boolean Series:要计算出现次数，您可以对这个布尔系列求和：

In [15]: df.words.str.contains(r'he|wo').sum()
Out[15]: 2

In [16]: df.words.str.contains(r'he').sum()
Out[16]: 1

Answer 2

To count the total number of matches, use s.str.match(...).str.get(0).count() .要计算匹配的总数，请使用s.str.match(...).str.get(0).count() 。

If your regex will be matching several unique words, to be tallied individually, use s.str.match(...).str.get(0).groupby(lambda x: x).count()如果您的正则表达式将匹配几个唯一的单词，要单独计算，请使用s.str.match(...).str.get(0).groupby(lambda x: x).count()

It works like this:它是这样工作的：

In [12]: s
Out[12]: 
0    ax
1    ay
2    bx
3    by
4    bz
dtype: object

The match string method handles regular expressions... match字符串方法处理正则表达式...

In [13]: s.str.match('(b[x-y]+)')
Out[13]: 
0       []
1       []
2    (bx,)
3    (by,)
4       []
dtype: object

...but the results, as given, are not very convenient. ...但结果，正如给定的，不是很方便。 The string method get takes the matches as strings and converts empty results to NaNs...字符串方法get将匹配项作为字符串并将空结果转换为 NaN...

In [14]: s.str.match('(b[x-y]+)').str.get(0)
Out[14]: 
0    NaN
1    NaN
2     bx
3     by
4    NaN
dtype: object

...which are not counted. ……不计算在内。

In [15]: s.str.match('(b[x-y]+)').str.get(0).count()
Out[15]: 2

Answer 3

You can use value_count function.您可以使用value_count函数。

import pandas as pd

# URL to .csv file
data_url = 'https://vincentarelbundock.github.io/Rdatasets/csv/carData/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col=0)

# pandas count distinct values in column
df['sex'].value_counts()

Source: link来源：链接

计算熊猫数据帧中每个特定单词的出现次数

问题描述

3 个解决方案

解决方案1
55 已采纳 2013-07-10 15:08:46

解决方案2
6 2013-07-10 15:08:21

解决方案3
0 2021-04-22 16:06:21

计算熊猫数据帧中每个特定单词的出现次数

问题描述

3 个解决方案

解决方案1 55 已采纳 2013-07-10 15:08:46

解决方案2 6 2013-07-10 15:08:21

解决方案3 0 2021-04-22 16:06:21

解决方案1
55 已采纳 2013-07-10 15:08:46

解决方案2
6 2013-07-10 15:08:21

解决方案3
0 2021-04-22 16:06:21