简体   繁体   English

在Python中使用NLTK的短语的一致性

[英]concordance for a phrase using NLTK in Python

Is it possible to get concordance for a phrase in NLTK? 是否有可能在NLTK中获得一个短语的一致性?

import nltk
from nltk.corpus import PlaintextCorpusReader

corpus_loc = "c://temp//text//"
files = ".*\.txt"
read_corpus = PlaintextCorpusReader(corpus_loc, files)
corpus  = nltk.Text(read_corpus.words())
test = nltk.TextCollection(corpus_loc)

corpus.concordance("claim")

for example the above returns 例如上面的回报

on okay okay okay i can give you the claim number and my information and
 decide on the shop okay okay so the claim number is xxxx - xx - xxxx got

and now if I try corpus.concordance("claim number") it does not work... I do have the code to do this with just by using .partition() method and some further coding on the same... but I'm wondering if it's possible to do the same using concordance . 现在,如果我尝试使用corpus.concordance("claim number")它不起作用...我确实有这样的代码,只需使用.partition()方法和一些进一步的编码...但我我想知道是否可以使用concordance来做同样的事情。

根据这个问题 ,使用concordance()函数搜索多个单词是不可能的。

If you read the discussion under the very issue that @b3000 dug up, you'll see that strangely enough, multi-word concordance is in fact available-- but only in the graphical concordance tool, which you can start up like this: 如果您在@ b3000挖出的问题下阅读讨论,您会看到奇怪的,多字一致性实际上是可用的 - 但只能在图形一致性工具中,您可以像这样启动:

>>> from nltk.app import concordance
>>> concordance()

I munged together this solution... 我把这个解决方案联系在一起......

def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
    #concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/

    phraseList=phrase.split(' ')

    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())

    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    offsets_norm=[]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    #--
    # http://stackoverflow.com/a/3852792/454773
    #the intersection method takes an arbitrary amount of arguments
    #result = set(d[0]).intersection(*d[1:])
    #--
    intersects=set(offsets_norm[0]).intersection(*offsets_norm[1:])

    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                    for offset in intersects])

    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)

    return

n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)

n_concordance_tokenised(text1,'monstrous size')
>> [u'one was of a most monstrous size . ... This came towards ',
    u'; for Whales of a monstrous size are oftentimes cast up dead ']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM