简体   繁体   English

使用Python进行带计数的最常见句子提取

[英]Most common sentences extractions with count using Python

I want to write a Python Script that searches all Excel rows and returns top 10 most common sentences. 我想编写一个Python脚本来搜索所有Excel行并返回前10个最常见的句子。 I have written the basics of ngrams for a txt file. 我已经为txt文件编写了ngram的基础知识。

The file contains csv text with dj is best 4 times and gd is cool 3 times. 该文件包含csv文本,其中dj最好4倍,而gd最好3倍。

import nltk
import pandas as pd

file = open('dj.txt', encoding="utf8")
text= file.read()
length = [3]
ngrams_count = {}
for n in length:
    ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
    ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})
ngrams_count
df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())), 
                  columns=['Ngramm', 'Count']).sort_values(['Count'], 
                                                           ascending=False)
df

Output - 输出-

   Ngramm  Count
1                      is best,dj is      4
3                      is cool,gd is      2
21                     is best,gd is      2
25                best,dj is Best,dj      1
19                    not cool,dj is      1
20                cool,dj is best,gd      1
22                best,gd is cool,dj      1
23                     is cool,dj is      1
24                cool,dj is best,dj      1
0                      dj is best,dj      1
18                    is not cool,dj      1
27                Best,dj is best,dj      1
28                best,dj is best,dj      1
29                best,dj is best,gd      1
30                best,gd is cool,gd      1
31                cool,gd is COOL,gd      1
32                     is COOL,gd is      1
26                     is Best,dj is      1
17                    good,dj is not      1
16                    not good,dj is      1
15                    is not good,dj      1
14                  better,dj is not      1
13                   is better,dj is      1
12         good,sandeep is better,dj      1
11                is good,sandeep is      1
10    excellent,prem is good,sandeep      1
9               is excellent,prem is      1
8   superb,sandeep is excellent,prem      1
7               is superb,sandeep is      1
6        best,prem is superb,sandeep      1
5                    is best,prem is      1
4               cool,gd is best,prem      1
2                 best,dj is cool,gd      1
33                   COOL,gd is cool      1

So firstly, It shows 2 for gd is cool , i cant figure out why ?.. and then I want to sort this output so that it shows something like this 所以,首先,它显示2 for gd很酷,我想不出为什么吗?..然后我想对这个输出进行排序,以便显示类似这样的内容

Ngramm  Count
dj is cool   4
gd is cool   3
....and so on....

And then i want this to do it for excel file row by row. 然后我希望这能针对Excel文件逐行执行。

I am really new at this can anyone point me in the right direction? 我真的很陌生,有人能指出我正确的方向吗?

As you can see, text.split(' ') does not split on punctuation, like commas. 正如你所看到的, text.split(' ') 分裂的标点符号,如逗号。
A quick and dirty fix for this particular data (where the only punctuation appearing seems to be commas, and none of them are trailed by whitespace) could be writing. 可能正在编写针对此特定数据的快速而肮脏的修复程序(其中出现的唯一标点似乎是逗号,而空格都没有尾随它们)。

text.replace(',',' ').split(' ')
 >>> "ab,c".split(' ') ['a', 'b,c'] # <--- 2 elements >>> "ab,c".replace(',',' ').split(' ') ['a', 'b', 'c'] # <--- 3 elements 

On the longer run you may want to learn about regular expressions , which can be a painful experience, but for this case it is easy: 从长远来看,您可能想了解正则表达式 ,这可能是一个痛苦的经历,但是在这种情况下,这很容易:

 >>> import re >>> re.split("[ ,]+","ab,c") ['a', 'b', 'c'] 

Since this is a csv file, please do yourself a favor and parse the csv first! 由于这是一个csv文件,请帮自己一个忙,并首先解析csv! Then take the contents and process them any way you want. 然后拿走内容并按您想要的方式处理它们。 But your data seems to contain one "sentence" per cell, so if our goal is to find the most common sentence, why are you throwing tokenization and ngrams at this task? 但是您的数据似乎每个单元格包含一个“句子”,因此,如果我们的目标是找到最普通的句子,那么为什么要在此任务上抛出标记化和ngram?

import csv
from collections import Counter
with open('dj.txt', encoding="utf8") as handle:
    sentcounts = Counter(cell for row in csv.reader(handle) for cell in row)

print("Frequency  Sentence")
for sent, freq in sentcounts.most_common(5):
    print("%9d"%freq, sent)

If you did want the tokens you could just use split() in this simple case, but for more realistic text use nltk.word_tokenize() , which knows all about punctuation. 如果确实想要令牌,则可以在这种简单情况下使用split() ,但对于更实际的文本,请使用nltk.word_tokenize() ,它了解所有有关标点的知识。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM