简体   繁体   中英

Finding common phrases using python

I am trying to take a CSV file and find the common phrases and the count using Python 2.7. Currently I can only get individual words and their counts, but I need common phrases.

Here's my code so far:

import csv
from sys import argv
from collections import defaultdict
from collections import Counter
script, filename = argv
data = defaultdict(list)

with open (filename, 'rb') as f:
    reader = csv.reader(f)
    text_file = open("output.txt", "w")
    next(reader, None)
    for row in reader:
        data[row[2]].append(row[3])
        text_file.write("%r" % data)
    text_file.close()

print(data)
c = Counter(defaultdict)
print c.most_common(10)

If you are going to be doing this for more than one file or for large files, I suggest using an indexing engine like Lucene .

You can Index n-grams (phrases of n-words) into Lucene and then use Lucene's query and search API to easily rank and find phrases with highest occurence.

Lucene is supported in Python with pylucene

First, consider phrases using a natural language tokenizer. Even the simplest language has an enormous amount of subtleties on the definition of a sentence, ie, trying to parse phrases with a regex is probably going to drive you crazy.

From there, use your approach on counting the frequency of "phrases", instead of words, as you are already doing, considering that "common phrases" means those that appear more than once. If that is not what you mean for "common phrases", than you should further clarify in your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM