[英]Find relative count of most common words from set of sentences in Python
[英]Most common sentences extractions with count using Python
我想编写一个Python脚本来搜索所有Excel行并返回前10个最常见的句子。 我已经为txt文件编写了ngram的基础知识。
该文件包含csv文本,其中dj最好4倍,而gd最好3倍。
import nltk
import pandas as pd
file = open('dj.txt', encoding="utf8")
text= file.read()
length = [3]
ngrams_count = {}
for n in length:
ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})
ngrams_count
df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())),
columns=['Ngramm', 'Count']).sort_values(['Count'],
ascending=False)
df
输出-
Ngramm Count
1 is best,dj is 4
3 is cool,gd is 2
21 is best,gd is 2
25 best,dj is Best,dj 1
19 not cool,dj is 1
20 cool,dj is best,gd 1
22 best,gd is cool,dj 1
23 is cool,dj is 1
24 cool,dj is best,dj 1
0 dj is best,dj 1
18 is not cool,dj 1
27 Best,dj is best,dj 1
28 best,dj is best,dj 1
29 best,dj is best,gd 1
30 best,gd is cool,gd 1
31 cool,gd is COOL,gd 1
32 is COOL,gd is 1
26 is Best,dj is 1
17 good,dj is not 1
16 not good,dj is 1
15 is not good,dj 1
14 better,dj is not 1
13 is better,dj is 1
12 good,sandeep is better,dj 1
11 is good,sandeep is 1
10 excellent,prem is good,sandeep 1
9 is excellent,prem is 1
8 superb,sandeep is excellent,prem 1
7 is superb,sandeep is 1
6 best,prem is superb,sandeep 1
5 is best,prem is 1
4 cool,gd is best,prem 1
2 best,dj is cool,gd 1
33 COOL,gd is cool 1
所以,首先,它显示2 for gd很酷,我想不出为什么吗?..然后我想对这个输出进行排序,以便显示类似这样的内容
Ngramm Count
dj is cool 4
gd is cool 3
....and so on....
然后我希望这能针对Excel文件逐行执行。
我真的很陌生,有人能指出我正确的方向吗?
正如你所看到的, text.split(' ')
不分裂的标点符号,如逗号。
可能正在编写针对此特定数据的快速而肮脏的修复程序(其中出现的唯一标点似乎是逗号,而空格都没有尾随它们)。
text.replace(',',' ').split(' ')
>>> "ab,c".split(' ') ['a', 'b,c'] # <--- 2 elements >>> "ab,c".replace(',',' ').split(' ') ['a', 'b', 'c'] # <--- 3 elements
从长远来看,您可能想了解正则表达式 ,这可能是一个痛苦的经历,但是在这种情况下,这很容易:
>>> import re >>> re.split("[ ,]+","ab,c") ['a', 'b', 'c']
由于这是一个csv文件,请帮自己一个忙,并首先解析csv! 然后拿走内容并按您想要的方式处理它们。 但是您的数据似乎每个单元格包含一个“句子”,因此,如果我们的目标是找到最普通的句子,那么为什么要在此任务上抛出标记化和ngram?
import csv
from collections import Counter
with open('dj.txt', encoding="utf8") as handle:
sentcounts = Counter(cell for row in csv.reader(handle) for cell in row)
print("Frequency Sentence")
for sent, freq in sentcounts.most_common(5):
print("%9d"%freq, sent)
如果确实想要令牌,则可以在这种简单情况下使用split()
,但对于更实际的文本,请使用nltk.word_tokenize()
,它了解所有有关标点的知识。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.