简体   繁体   中英

Most common sentences extractions with count using Python

I want to write a Python Script that searches all Excel rows and returns top 10 most common sentences. I have written the basics of ngrams for a txt file.

The file contains csv text with dj is best 4 times and gd is cool 3 times.

import nltk
import pandas as pd

file = open('dj.txt', encoding="utf8")
text= file.read()
length = [3]
ngrams_count = {}
for n in length:
    ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
    ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})
ngrams_count
df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())), 
                  columns=['Ngramm', 'Count']).sort_values(['Count'], 
                                                           ascending=False)
df

Output -

   Ngramm  Count
1                      is best,dj is      4
3                      is cool,gd is      2
21                     is best,gd is      2
25                best,dj is Best,dj      1
19                    not cool,dj is      1
20                cool,dj is best,gd      1
22                best,gd is cool,dj      1
23                     is cool,dj is      1
24                cool,dj is best,dj      1
0                      dj is best,dj      1
18                    is not cool,dj      1
27                Best,dj is best,dj      1
28                best,dj is best,dj      1
29                best,dj is best,gd      1
30                best,gd is cool,gd      1
31                cool,gd is COOL,gd      1
32                     is COOL,gd is      1
26                     is Best,dj is      1
17                    good,dj is not      1
16                    not good,dj is      1
15                    is not good,dj      1
14                  better,dj is not      1
13                   is better,dj is      1
12         good,sandeep is better,dj      1
11                is good,sandeep is      1
10    excellent,prem is good,sandeep      1
9               is excellent,prem is      1
8   superb,sandeep is excellent,prem      1
7               is superb,sandeep is      1
6        best,prem is superb,sandeep      1
5                    is best,prem is      1
4               cool,gd is best,prem      1
2                 best,dj is cool,gd      1
33                   COOL,gd is cool      1

So firstly, It shows 2 for gd is cool , i cant figure out why ?.. and then I want to sort this output so that it shows something like this

Ngramm  Count
dj is cool   4
gd is cool   3
....and so on....

And then i want this to do it for excel file row by row.

I am really new at this can anyone point me in the right direction?

As you can see, text.split(' ') does not split on punctuation, like commas.
A quick and dirty fix for this particular data (where the only punctuation appearing seems to be commas, and none of them are trailed by whitespace) could be writing.

text.replace(',',' ').split(' ')
 >>> "ab,c".split(' ') ['a', 'b,c'] # <--- 2 elements >>> "ab,c".replace(',',' ').split(' ') ['a', 'b', 'c'] # <--- 3 elements 

On the longer run you may want to learn about regular expressions , which can be a painful experience, but for this case it is easy:

 >>> import re >>> re.split("[ ,]+","ab,c") ['a', 'b', 'c'] 

Since this is a csv file, please do yourself a favor and parse the csv first! Then take the contents and process them any way you want. But your data seems to contain one "sentence" per cell, so if our goal is to find the most common sentence, why are you throwing tokenization and ngrams at this task?

import csv
from collections import Counter
with open('dj.txt', encoding="utf8") as handle:
    sentcounts = Counter(cell for row in csv.reader(handle) for cell in row)

print("Frequency  Sentence")
for sent, freq in sentcounts.most_common(5):
    print("%9d"%freq, sent)

If you did want the tokens you could just use split() in this simple case, but for more realistic text use nltk.word_tokenize() , which knows all about punctuation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM