[英]How can I get pairs of words from a sentence with NLTK?
I want to take in a sentence:我想总结一句话:
sentence = "How many people are here"?
and return a list of phrases:并返回一个短语列表:
pairs = ["How many", "many people", "people are", "are here"]
I tried我试过
tokens = nltk.word_tokenize(sentence)
pairs = nltk.bigrams(tokens)
and instead got <generator object bigrams at 0x103697820>
取而代之的是
<generator object bigrams at 0x103697820>
Im pretty new to nltk so sorry that this is so off :) Help appreciated!我对 nltk 很陌生,很抱歉,这太糟糕了:) 帮助表示感谢!
As you mentioned, the nktk.bigrams()
function returns a generator object.正如您提到的,
nktk.bigrams()
函数返回一个生成器对象。 Generators need to be iterated through in order to get the values out.生成器需要迭代以获取值。 This can be done with
list()
, or by looping over the generator.这可以通过
list()
或通过循环生成器来完成。
Below, I'm looping/iterating over the generator object (results of nktk.bigrams()
) in a list comprehension, while at the same time using "".join()
to combine the pair (list) of words, shed by the generator, into a single string, as desired.下面,我在列表理解中循环/迭代生成器对象(
nktk.bigrams()
结果),同时使用"".join()
组合单词对(列表),由根据需要将生成器转换为单个字符串。
tokens = nltk.word_tokenize(sentence)
pairs = [ " ".join(pair) for pair in nltk.bigrams(tokens)]
['How many', ...]
['多少', ...]
This should solve your problem:这应该可以解决您的问题:
import re
f = open('D:\Jupyter notebook\SNPQ.txt','r')
text = f.read()
text = re.sub('^\n|\n$','',(text))
for no,line in enumerate(text.splitlines()):
print('"'+'","'.join([i.replace('"','\\"').strip() for i in re.split('(?<=^[0-9]{2})([0-9]{13}| {13})| +',text.splitlines()[no].strip()) if i != None])+'"')
Thank you :)谢谢 :)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.