如何仅生成二元/三元语料库

Question

Is there a way for Gensim to generate strictly the bigrams, trigrams in a list of words? Gensim 有没有办法在单词列表中严格生成二元组、三元组？

I can successfully generate the unigrams, bigrams, trigrams but I would like to extract only the bigrams, trigrams.我可以成功生成一元组、二元组、三元组，但我只想提取二元组、三元组。

For example, in the list below:例如，在下面的列表中：

words = [['the', 'mayor', 'of', 'new', 'york', 'was', 'there'],["i","love","new","york"],["new","york","is","great"]]

I use我用

bigram = gensim.models.Phrases(words, min_count=1, threshold=1)
bigram_mod = gensim.models.phrases.Phraser(bigram)
words_bigram = [bigram_mod[doc] for doc in words]

This creates a list of unigrams and bigrams as follows:这将创建一个 unigrams 和 bigrams 列表，如下所示：

[['the', 'mayor', 'of', 'new_york', 'was', 'there'],
 ['i', 'love', 'new_york'],
 ['new_york', 'is', 'great']]

My question is, is there a way (other than regular expressions) to extract strictly the bigrams, so that in this example only "new_york" would be a result?我的问题是，有没有办法（除了正则表达式）严格提取二元组，以便在这个例子中只有“new_york”会是结果？

Answer 1

It's not a built-in option of the gensim Phrases functionality.它不是 gensim Phrases功能的内置选项。

If we can assume none of your original unigrams had the '_' character in them, a step to select only tokens with a '_' shouldn't be too expensive (and doesn't need full regular expressions).如果我们可以假设您的原始 unigram 中没有一个包含'_'字符，那么仅选择带有'_'标记的步骤应该不会太昂贵（并且不需要完整的正则表达式）。 For example, your last line could be:例如，您的最后一行可能是：

words_bigram = [ [token for token in bigram_mod[doc] if '_' in token] for doc in words ]

(You could change the joining character if for some reason there were underscores in your unigrams, and you didn't want those confused with Phrases -combined bigrams.) （如果出于某种原因，您的 unigrams 中有下划线，您可以更改连接字符，并且您不希望那些与Phrases组合的 bigrams 混淆。）

If none of that is good enough, you could potentially look at the code in gensim which actually scores & combines unigrams into bigrams...如果这些都不够好，您可能会查看 gensim 中的代码，该代码实际上将 unigrams 评分并组合成 bigrams ......

https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300 https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300

...and either extend that module with your extra needed option, or mimic its behavior outside the class in your own code. ...并且要么使用您额外需要的选项扩展该模块，要么在您自己的代码中在类之外模仿其行为。

如何仅生成二元/三元语料库

问题描述

1 个解决方案

解决方案1
0 2020-01-16 19:24:19

如何仅生成二元/三元语料库

问题描述

1 个解决方案

解决方案1 0 2020-01-16 19:24:19

解决方案1
0 2020-01-16 19:24:19