如何使用Python和NLTK从语料库中提取关键词（不是最常用的词）？

Question

我正在尝试从文本或语料库中提取关键词。 这些不是最常用的词，而是与文本最“有关”的词。 我有一个比较示例，我生成的列表与示例列表有很大不同。 您能否给我一个指示，以生成一个很好的关键字列表，其中不包括诸如“您”和“ tis”之类的低义词？

我使用“罗密欧与朱丽叶”作为文字。 我的方法（请参见下面的Scott和Tribble）是将R＆J与莎士比亚的完整剧本进行比较，并且与完整剧本相比，在R＆J中出现频率更高的单词要少得多。 那应该清除掉诸如“ the”之类的常用词，但是在我的代码中却没有。

我收到了很多单词，例如“你”，“她”和“ tis”，这些单词没有出现在他们的列表中，而且我也没有收到诸如“贬低”和“教堂”之类的单词。 我正在获得“ romeo”，“ juliet”，“ capulet”和“ nurse”，因此，如果实际上不在正确的轨道上，我至少会接近。

这是从文字中拉出单词和百分比的函数：

def keywords(corpus, threshold=0):
    """ generates a list of possible keywords and the percentage of 
           occurrences.
          corpus (list): text or collection of texts
          threshold (int): min # of occurrences of word in corpus                    
              target text has threshold 3, ref corp has 0
          return percentKW: list of tuples (word, percent)                         
    """

    # get freqDist of corpus as dict. key is word, value = # occurences
    fdist = FreqDist(corpus)
    n = len(corpus)

    # create list of tuple of w meeting threshold & sort w/most common first
    t = [(k, v) for k, v in fdist.items() if v >= threshold]
    t = sorted(t, key=lambda tup: tup[1], reverse=True)

    # calculate number of total tokens
    n = len(corpus)

    # return list of tuples (word, percent word is of total tokens)
    percentKW =[(k, '%.2f'%(100*(v/n))) for k, v in t]
    return percentKW

这是调用代码的关键部分。 targetKW是R＆J，而refcorpKWDict是完整的莎士比亚戏剧。

# iterate through text list of tuples
for w, p in targetKW:
    # for each word, store the percent in KWList
    targetPerc = float(p)
    refcorpPerc = float(refcorpKWDict.get(w, 0))
    # if % in text > % in reference corpus
    if (refcorpPerc or refcorpPerc == 0) and (targetPerc > refcorpPerc):
        diff = float('%.2f'%(targetPerc - refcorpPerc))
        # save result to KWList
        KWList.append((w, targetPerc, refcorpPerc, diff))

到目前为止，这是我尝试过的方法：

将所有潜在关键词标准化为小写（帮助）
创建关键字的自定义简短列表（文本和比较文本）。 似乎可以工作，但什么也没告诉我
将R＆J与删节的剧本，戏剧+十四行诗和Brown语料库进行比较（无济于事）
检查了潜在关键字（例如“已屏蔽”）的百分比。 百分比远低于预期。 我不确定如何解释。
设置潜在关键字的最小长度，以消除诸如“ ll”和“ is”之类的单词（帮助）
搜索了这个问题。 （找不到任何东西）

我正在使用IDLE版本3.5.2在Windows 10上使用Python 3.5.2。

资料来源：在“使用Python进行自然语言处理”（ http://www.nltk.org/book/ ）中，练习4.24是“阅读'关键字链接'（Scott＆Tribble，2006年，第5章）”。 NLTK的莎士比亚语料库中的关键字，并使用NetworkX软件包，绘制关键字链接网络。” 我正在独自阅读本书，以进行职业发展。 参考的2006年书籍为“文本模式：语言教育中的关键词和语料库分析”（尤其是第58-60页）

感谢您的时间。

Answer 1

可能有用的两种可能的技术（可能会从书本中脱颖而出）是词频逆文档频度（通常为TFIDF）对单词的权重...和并置。

与更大的相似文档集相比，TFIDF用于确定文档中的重要单词。 它通常用作机器学习的基础知识，以进行自动分类（情感分析等）。

TFIDF本质上是查看整个游戏语料库，并根据单词在每个游戏中的重要性为每个单词实例分配一个值，并对该术语在整个语料库中的重要性进行加权。 因此，理想情况下，您将“ TFIDF”模型“适合”莎士比亚戏剧的整个语料库（包括“罗密欧”和“朱丽叶”），然后将“罗密欧”和“朱丽叶”“转换”为一系列单词分数。 然后，在莎士比亚的所有戏剧中，您都会找到得分最高的术语，这对罗密欧与朱丽叶最为重要。

我发现一些TFIDF指南很有帮助...

https://buhrmann.github.io/tfidf-analysis.html

http://www.ultravioletanalytics.com/2016/11/18/tf-idf-basics-with-pandas-scikit-learn/

并置在NLTK中可用，并且相当容易实现。 搭配寻找的短语，单词通常一起出现。 这些通常也对指示文本“关于”有用。 http://www.nltk.org/howto/collocations.html

如果您对这两种技术感兴趣，很乐意为您提供帮助。

Answer 2

我已经为正在进行的一个项目准备了TF-IDF，所以我们开始吧。 基本上不需要代码中的Pandas或Numpy函数，尽管我强烈建议使用Pandas，因为我将Pandas用作管理数据的必备工具。 您需要Scikit Learn进行TFIDF矢量化。 如果尚未安装，则需要先安装。 看起来只是使用pip install scikit-learn[alldeps]应该可以解决问题，但就我个人而言，我使用的是Anaconda ，它已预先安装了所有内容，因此我没有涉及到这方面的内容。 我一步一步地分解了在罗密欧与朱丽叶中寻找重要术语的过程。 还有很多步骤可以解释下面每个对象的内容，但是底部仅列出了具有必要步骤的完整代码。

一步步

from sklearn.feature_extraction.text import TfidfVectorizer

# Two sets of documents
# plays_corpus contains all documents in your corpus *including Romeo and Juliet*
plays_corpus = ['This is Romeo and Juliet','this is another play','and another','and one more']

#romeo is a list that contains *just* the text for Romeo and Juliet
romeo = [plays_corpus[0]] # must be in a list even if only one object

# Initialise your TFIDF Vectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Now create a model by fitting the vectorizer to your main plays corpus. This is essentially an array of TFIDF scores.
model =  tfidf_vectorizer.fit_transform(plays_corpus)

如果您很好奇，这就是数组的样子。 每行代表您的语料库中的一个文档，而每一列则是按字母顺序排列的每个唯一术语。 在这种情况下，行跨两行，术语分别为['和'，'另一个”，“是”，“朱丽叶”，“更多”，“一个”，“播放”，“罗密欧”，“此” ]。

tfidf_vectorizer.fit_transform(plays_corpus).toarray()
array([[ 0.33406745,  0.        ,  0.41263976,  0.52338122,  0.        ,
         0.        ,  0.        ,  0.52338122,  0.41263976],
       [ 0.        ,  0.46580855,  0.46580855,  0.        ,  0.        ,
         0.        ,  0.59081908,  0.        ,  0.46580855],
       [ 0.62922751,  0.77722116,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.41137791,  0.        ,  0.        ,  0.        ,  0.64450299,
         0.64450299,  0.        ,  0.        ,  0.        ]])

接下来，我们创建所有唯一terms的列表（这就是我上面知道的唯一术语的方式）。

terms = tfidf_vectorizer.get_feature_names()

因此，现在我们有了tfidf评分的主要模型，该模型针对每个文档中的每个术语分别对其在其直接上下文（文档）和其较大上下文（语料库）中的重要性进行评分。

为了找出在Romeo和Juliet中特定术语的分数，我们现在使用我们的模型来.transform该文档。

romeo_scored = tfidf_vectorizer.transform(romeo) # note .transform NOT .fit_transform

这会再次创建一个数组，但是一个数组只有一行（因为只有一个文档）。

romeo_scored.toarray()
array([[ 0.33406745,  0.        ,  0.41263976,  0.52338122,  0.        ,
         0.        ,  0.        ,  0.52338122,  0.41263976]])

我们可以轻松地将此数组转换为分数列表

# we first view the object as an array, 
# then flatten it as the array is currently like a list in a list.
# Then we transform that array object into a simple list object.
scores = romeo_scored.toarray().flatten().tolist()

现在，我们有了模型中的术语列表，以及每个术语的分数列表，这些分数特定于Romeo和Juliet。 这些有用的顺序也同样有用，这意味着我们可以将它们放到元组列表中。

data = list(zip(terms,scores)

# Which looks like
[('and', 0.3340674500232949),
 ('another', 0.0),
 ('is', 0.41263976171812644),
 ('juliet', 0.5233812152405496),
 ('more', 0.0),
 ('one', 0.0),
 ('play', 0.0),
 ('romeo', 0.5233812152405496),
 ('this', 0.41263976171812644)]

现在我们只需要按分数对它进行排序即可获得排名靠前的项目

# Here we sort the data using 'sorted',
# we choose to provide a sort key,
# our key is lambda x: x[1]
# x refers to the object we're processing (data)
# and [1] specifies the second part of the tuple - the score.
# x[0] would sort by the first part - the term.
# reverse = True switches from Ascending to Descending order.

sorted_data = sorted(data, key=lambda x: x[1],reverse=True)

最终，这给了我们……

[('juliet', 0.5233812152405496),
 ('romeo', 0.5233812152405496),
 ('is', 0.41263976171812644),
 ('this', 0.41263976171812644),
 ('and', 0.3340674500232949),
 ('another', 0.0),
 ('more', 0.0),
 ('one', 0.0),
 ('play', 0.0)]

您可以通过切片列表将其限制为前N个字词。

sorted_data[:3]
[('juliet', 0.5233812152405496),
 ('romeo', 0.5233812152405496),
 ('is', 0.41263976171812644)]

完整代码

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

# Two sets of documents
# plays_corpus contains all documents in your corpus *including Romeo and Juliet*
plays_corpus = ['This is Romeo and Juliet','this is another play','and another','and one more']

#romeo is a list that contains *just* the text for Romeo and Juliet
romeo = [plays_corpus[0]] # must be in a list even if only one object

# Initialise your TFIDF Vectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Now create a model by fitting the vectorizer to your main plays corpus, this creates an array of TFIDF scores
model = tfidf_vectorizer.fit_transform(plays_corpus)

romeo_scored = tfidf_vectorizer.transform(romeo) # note - .fit() not .fit_transform

terms = tfidf_vectorizer.get_feature_names()

scores = romeo_scored.toarray().flatten().tolist()

data = list(zip(terms,scores))

sorted_data = sorted(data,key=lambda x: x[1],reverse=True)

sorted_data[:5]

Answer 3

与您的代码的问题是，你在你接受什么样的“关键词”太宽容：其频率甚至在你的文字会参照语料库大一点会被视为一个关键字任何单词。 从逻辑上讲，这应该使您净掉大约一半没有特殊状态的单词。

if (refcorpPerc or refcorpPerc == 0) and (targetPerc > refcorpPerc):
    # accept it as a "key word"

要使测试更具选择性，请选择更大的阈值或使用更智能的度量（例如“超出等级的度量”（用谷歌搜索）），和/或对候选关键字进行排名，并仅将其排在首位，即相对频率的最大增加。

如何使用Python和NLTK从语料库中提取关键词（不是最常用的词）？

问题描述

3 个解决方案

解决方案1
0 2017-04-27 16:09:53

解决方案2
0 已采纳 2017-04-27 21:41:08

一步步

完整代码

解决方案3
0 2017-04-30 13:08:05

如何使用Python和NLTK从语料库中提取关键词（不是最常用的词）？

问题描述

3 个解决方案

解决方案1 0 2017-04-27 16:09:53

解决方案2 0 已采纳 2017-04-27 21:41:08

一步步

完整代码

解决方案3 0 2017-04-30 13:08:05

解决方案1
0 2017-04-27 16:09:53

解决方案2
0 已采纳 2017-04-27 21:41:08

解决方案3
0 2017-04-30 13:08:05