简体   繁体   English

使用python查找常用短语

[英]Finding common phrases using python

I am trying to take a CSV file and find the common phrases and the count using Python 2.7. 我正在尝试获取一个CSV文件,并使用Python 2.7查找常用短语和计数。 Currently I can only get individual words and their counts, but I need common phrases. 目前,我只能得到单个单词及其数量,但是我需要常用短语。

Here's my code so far: 到目前为止,这是我的代码:

import csv
from sys import argv
from collections import defaultdict
from collections import Counter
script, filename = argv
data = defaultdict(list)

with open (filename, 'rb') as f:
    reader = csv.reader(f)
    text_file = open("output.txt", "w")
    next(reader, None)
    for row in reader:
        data[row[2]].append(row[3])
        text_file.write("%r" % data)
    text_file.close()

print(data)
c = Counter(defaultdict)
print c.most_common(10)

If you are going to be doing this for more than one file or for large files, I suggest using an indexing engine like Lucene . 如果要对多个文件或大型文件执行此操作,建议您使用像Lucene这样的索引引擎。

You can Index n-grams (phrases of n-words) into Lucene and then use Lucene's query and search API to easily rank and find phrases with highest occurence. 您可以将n-gram(n个单词的短语)编入Lucene,然后使用Lucene的查询和搜索API轻松对出现次数最高的短语进行排名和查找。

Lucene is supported in Python with pylucene pylucene在Python中支持Lucene

First, consider phrases using a natural language tokenizer. 首先,考虑使用自然语言标记器的短语。 Even the simplest language has an enormous amount of subtleties on the definition of a sentence, ie, trying to parse phrases with a regex is probably going to drive you crazy. 即使是最简单的语言,在句子的定义上也有很多微妙之处,例如,尝试用正则表达式解析短语可能会使您发疯。

From there, use your approach on counting the frequency of "phrases", instead of words, as you are already doing, considering that "common phrases" means those that appear more than once. 从那里开始,使用您的方法来计算“短语”(而不是单词)的频率,就像您已经在做的那样,考虑到“常用短语”意味着出现多次。 If that is not what you mean for "common phrases", than you should further clarify in your question. 如果这不是“常用短语”的意思,那么您应该在问题中进一步澄清。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM