使用Python进行带计数的最常见句子提取

Question

我想编写一个Python脚本来搜索所有Excel行并返回前10个最常见的句子。 我已经为txt文件编写了ngram的基础知识。

该文件包含csv文本，其中dj最好4倍，而gd最好3倍。

import nltk
import pandas as pd

file = open('dj.txt', encoding="utf8")
text= file.read()
length = [3]
ngrams_count = {}
for n in length:
    ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
    ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})
ngrams_count
df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())), 
                  columns=['Ngramm', 'Count']).sort_values(['Count'], 
                                                           ascending=False)
df

输出-

   Ngramm  Count
1                      is best,dj is      4
3                      is cool,gd is      2
21                     is best,gd is      2
25                best,dj is Best,dj      1
19                    not cool,dj is      1
20                cool,dj is best,gd      1
22                best,gd is cool,dj      1
23                     is cool,dj is      1
24                cool,dj is best,dj      1
0                      dj is best,dj      1
18                    is not cool,dj      1
27                Best,dj is best,dj      1
28                best,dj is best,dj      1
29                best,dj is best,gd      1
30                best,gd is cool,gd      1
31                cool,gd is COOL,gd      1
32                     is COOL,gd is      1
26                     is Best,dj is      1
17                    good,dj is not      1
16                    not good,dj is      1
15                    is not good,dj      1
14                  better,dj is not      1
13                   is better,dj is      1
12         good,sandeep is better,dj      1
11                is good,sandeep is      1
10    excellent,prem is good,sandeep      1
9               is excellent,prem is      1
8   superb,sandeep is excellent,prem      1
7               is superb,sandeep is      1
6        best,prem is superb,sandeep      1
5                    is best,prem is      1
4               cool,gd is best,prem      1
2                 best,dj is cool,gd      1
33                   COOL,gd is cool      1

所以，首先，它显示2 for gd很酷，我想不出为什么吗？..然后我想对这个输出进行排序，以便显示类似这样的内容

Ngramm  Count
dj is cool   4
gd is cool   3
....and so on....

然后我希望这能针对Excel文件逐行执行。

我真的很陌生，有人能指出我正确的方向吗？

Answer 1

正如你所看到的， text.split(' ') 不分裂的标点符号，如逗号。
可能正在编写针对此特定数据的快速而肮脏的修复程序（其中出现的唯一标点似乎是逗号，而空格都没有尾随它们）。

text.replace(',',' ').split(' ')

 >>> "ab,c".split(' ') ['a', 'b,c'] # <--- 2 elements >>> "ab,c".replace(',',' ').split(' ') ['a', 'b', 'c'] # <--- 3 elements

从长远来看，您可能想了解正则表达式，这可能是一个痛苦的经历，但是在这种情况下，这很容易：

 >>> import re >>> re.split("[ ,]+","ab,c") ['a', 'b', 'c']

Answer 2

由于这是一个csv文件，请帮自己一个忙，并首先解析csv！ 然后拿走内容并按您想要的方式处理它们。 但是您的数据似乎每个单元格包含一个“句子”，因此，如果我们的目标是找到最普通的句子，那么为什么要在此任务上抛出标记化和ngram？

import csv
from collections import Counter
with open('dj.txt', encoding="utf8") as handle:
    sentcounts = Counter(cell for row in csv.reader(handle) for cell in row)

print("Frequency  Sentence")
for sent, freq in sentcounts.most_common(5):
    print("%9d"%freq, sent)

如果确实想要令牌，则可以在这种简单情况下使用split() ，但对于更实际的文本，请使用nltk.word_tokenize() ，它了解所有有关标点的知识。

使用Python进行带计数的最常见句子提取

问题描述

2 个解决方案

解决方案1
0 2018-11-19 13:25:13

解决方案2
0 2018-11-19 20:32:51

使用Python进行带计数的最常见句子提取

问题描述

2 个解决方案

解决方案1 0 2018-11-19 13:25:13

解决方案2 0 2018-11-19 20:32:51

解决方案1
0 2018-11-19 13:25:13

解决方案2
0 2018-11-19 20:32:51