简体   繁体   English

如何使用 NLTK 从句子中获取单词对?

[英]How can I get pairs of words from a sentence with NLTK?

I want to take in a sentence:我想总结一句话:

sentence = "How many people are here"?

and return a list of phrases:并返回一个短语列表:

pairs = ["How many", "many people", "people are", "are here"]

I tried我试过

   tokens = nltk.word_tokenize(sentence)
   pairs = nltk.bigrams(tokens)

and instead got <generator object bigrams at 0x103697820>取而代之的是<generator object bigrams at 0x103697820>

Im pretty new to nltk so sorry that this is so off :) Help appreciated!我对 nltk 很陌生,很抱歉,这太糟糕了:) 帮助表示感谢!

As you mentioned, the nktk.bigrams() function returns a generator object.正如您提到的, nktk.bigrams()函数返回一个生成器对象。 Generators need to be iterated through in order to get the values out.生成器需要迭代以获取值。 This can be done with list() , or by looping over the generator.这可以通过list()或通过循环生成器来完成。

Below, I'm looping/iterating over the generator object (results of nktk.bigrams() ) in a list comprehension, while at the same time using "".join() to combine the pair (list) of words, shed by the generator, into a single string, as desired.下面,我在列表理解中循环/迭代生成器对象( nktk.bigrams()结果),同时使用"".join()组合单词对(列表),由根据需要将生成器转换为单个字符串。

tokens = nltk.word_tokenize(sentence)
pairs = [ " ".join(pair) for pair in nltk.bigrams(tokens)]

['How many', ...] ['多少', ...]

This should solve your problem:这应该可以解决您的问题:

import re
f = open('D:\Jupyter notebook\SNPQ.txt','r')
text = f.read()
text = re.sub('^\n|\n$','',(text))
for no,line in enumerate(text.splitlines()):
    print('"'+'","'.join([i.replace('"','\\"').strip() for i in re.split('(?<=^[0-9]{2})([0-9]{13}| {13})|  +',text.splitlines()[no].strip()) if i != None])+'"')

Thank you :)谢谢 :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM