简体   繁体   English

如何计算 Pandas 中最常重复的短语

[英]How count the most frequently repeated phrases in Pandas

I have Pandas dataframe with one text column.我有一个带有一个文本列的 Pandas 数据框。 I want to count what phrases are the most common in this column.我想数一数本专栏中哪些短语最常见。 For example, from the text, you can see that phrases like a very good movie , last night etc. appears a lot of time.例如,从文本中可以看出, a very good movielast night等短语出现了很多次。 I think that there is a way of defining n-grams, for example that phrase is between 3 and 5 words, but I do not know how to do that.我认为有一种定义 n-gram 的方法,例如该短语在 3 到 5 个单词之间,但我不知道该怎么做。

import pandas as pd


text = ['this is a very good movie that we watched last night',
        'i have watched a very good movie last night',
        'i love this song, its amazing',
        'what should we do if he asks for it',
        'movie last night was amazing',
        'a very nice song was played',
        'i would like to se a good show',
        'a good show was on tv last night']

df = pd.DataFrame({"text":text})
print(df)

So my goal is to rank the phrases (3-5 words) that appears a lot of times所以我的目标是对出现次数较多的词组(3-5个词)进行排名

First split text in list comprehension and flatten to vals , then create ngrams , pass to Series and last use Series.value_counts :首先在列表理解中split文本并展平为vals ,然后创建ngrams ,传递给Series并最后使用Series.value_counts

from nltk import ngrams
vals = [y for x in df['text'] for y in x.split()]

n = [3,4,5]
a = pd.Series([y for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
(a, good, show)                      2
(movie, last, night)                 2
(a, very, good)                      2
(last, night, i)                     2
(a, very, good, movie)               2
                                    ..
(should, we, do)                     1
(a, very, nice, song, was)           1
(asks, for, it, movie, last)         1
(this, song,, its, amazing, what)    1
(i, have, watched, a)                1
Length: 171, dtype: int64

Or if tuples should be joined by space:或者如果元组应该用空格连接:

n = [3,4,5]
a = pd.Series([' '.join(y) for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
last night i                  2
a good show                   2
a very good movie             2
very good movie               2
movie last night              2
                             ..
its amazing what should       1
watched last night i have     1
to se a                       1
very good movie last night    1
a very nice song was          1
Length: 171, dtype: int64

Another idea with Counter : Counter另一个想法:

from nltk import ngrams
from collections import Counter

vals = [y for x in df['text'] for y in x.split()]
c = Counter([' '.join(y) for x in [3,4,5] for y in ngrams(vals, x)])

df1 = pd.DataFrame({'ngrams': list(c.keys()),
                   'count': list(c.values())})
print (df1)
                   ngrams  count
0               this is a      1
1               is a very      1
2             a very good      2
3         very good movie      2
4         good movie that      1
..                    ...    ...
166  show a good show was      1
167    a good show was on      1
168   good show was on tv      1
169   show was on tv last      1
170  was on tv last night      1

[171 rows x 2 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM