在Python中使用NLTK的短语的一致性

Question

是否有可能在NLTK中获得一个短语的一致性？

import nltk
from nltk.corpus import PlaintextCorpusReader

corpus_loc = "c://temp//text//"
files = ".*\.txt"
read_corpus = PlaintextCorpusReader(corpus_loc, files)
corpus  = nltk.Text(read_corpus.words())
test = nltk.TextCollection(corpus_loc)

corpus.concordance("claim")

例如上面的回报

on okay okay okay i can give you the claim number and my information and
 decide on the shop okay okay so the claim number is xxxx - xx - xxxx got

现在，如果我尝试使用corpus.concordance("claim number")它不起作用...我确实有这样的代码，只需使用.partition()方法和一些进一步的编码...但我我想知道是否可以使用concordance来做同样的事情。

Answer 1

根据这个问题，使用concordance()函数搜索多个单词是不可能的。

Answer 2

如果您在@ b3000挖出的问题下阅读讨论，您会看到奇怪的，多字一致性实际上是可用的 - 但只能在图形一致性工具中，您可以像这样启动：

>>> from nltk.app import concordance
>>> concordance()

Answer 3

我把这个解决方案联系在一起......

def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
    #concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/

    phraseList=phrase.split(' ')

    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())

    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    offsets_norm=[]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    #--
    # http://stackoverflow.com/a/3852792/454773
    #the intersection method takes an arbitrary amount of arguments
    #result = set(d[0]).intersection(*d[1:])
    #--
    intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])

    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)&gt;0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                    for offset in intersects])

    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)

    return

n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)

n_concordance_tokenised(text1,'monstrous size')
>> [u'one was of a most monstrous size . ... This came towards ',
    u'; for Whales of a monstrous size are oftentimes cast up dead ']

在Python中使用NLTK的短语的一致性

问题描述

3 个解决方案

解决方案1
6 已采纳 2015-11-23 19:39:24

解决方案2
4 2015-11-23 20:25:36

解决方案3
2 2015-12-13 14:19:17

在Python中使用NLTK的短语的一致性

问题描述

3 个解决方案

解决方案1 6 已采纳 2015-11-23 19:39:24

解决方案2 4 2015-11-23 20:25:36

解决方案3 2 2015-12-13 14:19:17

解决方案1
6 已采纳 2015-11-23 19:39:24

解决方案2
4 2015-11-23 20:25:36

解决方案3
2 2015-12-13 14:19:17