Python - 比较多个文本文件中的n-gram

Question

第一次海报 - 我是一名新的Python用户，编程技巧有限。 最后，我试图在同一目录中找到的众多文本文档中识别和比较n-gram。 我的分析有点类似于抄袭检测 - 我想计算可以找到特定n-gram的文本文档的百分比。 现在，我正在尝试更大问题的更简单版本，尝试比较两个文本文档中的n-gram。 我没有问题确定n-gram，但我正在努力比较这两个文件。 有没有办法将n-gram存储在列表中，以便有效地比较两个文档中存在的n-gram？ 这是我到目前为止所做的（原谅天真的编码）。 作为参考，我提供下面的基本句子，而不是我在代码中实际阅读的文本文档。

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.split(), n)
trigrams2 = ngrams(text2.split(), n)

print(trigrams1)
for grams in trigrams1:
    print(grams)

def compare(trigrams1, trigrams2):
    for grams1 in trigrams1:
        if each_gram in trigrams2:
            print (each_gram)
    return False

感谢大家的帮助！

Answer 1

使用compare函数中的common列表。 将每个ngram附加到这两个三元组共有的列表中，最后将列表返回为：

>>> trigrams1 = ngrams(text1.lower().split(), n)  # use text1.lower() to ignore sentence case.
>>> trigrams2 = ngrams(text2.lower().split(), n)  # use text2.lower() to ignore sentence case.
>>> trigrams1
[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'jason')]
>>> trigrams2
[('my', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'mike')]
>>> def compare(trigrams1, trigrams2):
...    common=[]
...    for grams1 in trigrams1:
...       if grams1 in trigrams2:
...         common.append(grams1)
...    return common
... 
>>> compare(trigrams1, trigrams2)
[('my', 'name', 'is')]

Answer 2

我认为连接ngrams中的元素并创建字符串列表然后进行比较可能更容易。

让我们用你提供的例子来讨论这个过程。

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

应用后ngrams从NLTK会得到以下两个列表，我的名字同样功能text1和text2像以前一样：

text1 = [('Hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'Jason')]
text2 = [('My', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'Mike')]

当你想比较ngrams时，你应该小写所有元素，以免它将'my'和'My'算作单独的标记，这是我们显然不想要的。

以下功能正是如此。

def append_elements(n_gram):
    for element in range(len(n_gram)):
            phrase = ''
            for sub_element in n_gram[element]:
                    phrase += sub_element+' '
            n_gram[element] = phrase.strip().lower()
    return n_gram

因此，如果我们提供text1我们会得到['hello my name', 'my name is', 'name is jason'] ，这更容易处理。

接下来我们进行compare功能。 你是对的，我们可以假设我们可以使用一个列表来存储共性。 我把它命名为common ：

def compare(n_gram1, n_gram2):
    n_gram1 = append_elements(n_gram1)
    n_gram2 = append_elements(n_gram2)
    common = []
    for phrase in n_gram1:
        if phrase in n_gram2:
            common.append(phrase)
    if not common:
        return False
        # or you could print a message saying no commonality was found
    else:
        for i in common:
            print(i)

if not common则表示common列表为空，在这种情况下，它会打印一条消息或返回False

现在在你的例子中，当我们运行compare(text1, text2) ，唯一的共性是：

>>> 
my name is
>>>

这是正确的答案。

Answer 3

当我遇到这个旧线程时，我正在做一个与你非常相似的任务，除了有一个bug之外似乎工作得很好。 我将在这里添加这个答案，万一其他人偶然发现了这个问题。 该ngrams来自nltk.util返回一个发电机对象，而不是一个列表。 它需要转换为列表才能使用您编写的compare函数。 使用lower()进行不区分大小写的匹配。

完整的例子：

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.lower().split(), n)
trigrams2 = ngrams(text2.lower().split(), n)

def compare_ngrams(trigrams1, trigrams2):
    trigrams1 = list(trigrams1)
    trigrams2 = list(trigrams2)
    common=[]
    for gram in trigrams1:
        if gram in trigrams2:
            common.append(gram)
    return common

common = compare_ngrams(trigrams1, trigrams2)
print(common)

输出：

[('my', 'name', 'is')]

Python - 比较多个文本文件中的n-gram

问题描述

3 个解决方案

解决方案1
0 2014-12-11 00:17:43

解决方案2
0 2014-12-11 00:57:29

解决方案3
0 2019-08-21 18:04:43

Python - 比较多个文本文件中的n-gram

问题描述

3 个解决方案

解决方案1 0 2014-12-11 00:17:43

解决方案2 0 2014-12-11 00:57:29

解决方案3 0 2019-08-21 18:04:43

解决方案1
0 2014-12-11 00:17:43

解决方案2
0 2014-12-11 00:57:29

解决方案3
0 2019-08-21 18:04:43