[英]How to generate bigram/trigram corpus only
Is there a way for Gensim to generate strictly the bigrams, trigrams in a list of words? Gensim 有没有办法在单词列表中严格生成二元组、三元组?
I can successfully generate the unigrams, bigrams, trigrams but I would like to extract only the bigrams, trigrams.我可以成功生成一元组、二元组、三元组,但我只想提取二元组、三元组。
For example, in the list below:例如,在下面的列表中:
words = [['the', 'mayor', 'of', 'new', 'york', 'was', 'there'],["i","love","new","york"],["new","york","is","great"]]
I use我用
bigram = gensim.models.Phrases(words, min_count=1, threshold=1)
bigram_mod = gensim.models.phrases.Phraser(bigram)
words_bigram = [bigram_mod[doc] for doc in words]
This creates a list of unigrams and bigrams as follows:这将创建一个 unigrams 和 bigrams 列表,如下所示:
[['the', 'mayor', 'of', 'new_york', 'was', 'there'],
['i', 'love', 'new_york'],
['new_york', 'is', 'great']]
My question is, is there a way (other than regular expressions) to extract strictly the bigrams, so that in this example only "new_york" would be a result?我的问题是,有没有办法(除了正则表达式)严格提取二元组,以便在这个例子中只有“new_york”会是结果?
It's not a built-in option of the gensim Phrases
functionality.它不是 gensim Phrases
功能的内置选项。
If we can assume none of your original unigrams had the '_'
character in them, a step to select only tokens with a '_'
shouldn't be too expensive (and doesn't need full regular expressions).如果我们可以假设您的原始 unigram 中没有一个包含'_'
字符,那么仅选择带有'_'
标记的步骤应该不会太昂贵(并且不需要完整的正则表达式)。 For example, your last line could be:例如,您的最后一行可能是:
words_bigram = [ [token for token in bigram_mod[doc] if '_' in token] for doc in words ]
(You could change the joining character if for some reason there were underscores in your unigrams, and you didn't want those confused with Phrases
-combined bigrams.) (如果出于某种原因,您的 unigrams 中有下划线,您可以更改连接字符,并且您不希望那些与Phrases
组合的 bigrams 混淆。)
If none of that is good enough, you could potentially look at the code in gensim which actually scores & combines unigrams into bigrams...如果这些都不够好,您可能会查看 gensim 中的代码,该代码实际上将 unigrams 评分并组合成 bigrams ......
https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300 https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300
...and either extend that module with your extra needed option, or mimic its behavior outside the class in your own code. ...并且要么使用您额外需要的选项扩展该模块,要么在您自己的代码中在类之外模仿其行为。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.