简体   繁体   English

如何仅生成二元/三元语料库

[英]How to generate bigram/trigram corpus only

Is there a way for Gensim to generate strictly the bigrams, trigrams in a list of words? Gensim 有没有办法在单词列表中严格生成二元组、三元组?

I can successfully generate the unigrams, bigrams, trigrams but I would like to extract only the bigrams, trigrams.我可以成功生成一元组、二元组、三元组,但我只想提取二元组、三元组。

For example, in the list below:例如,在下面的列表中:

words = [['the', 'mayor', 'of', 'new', 'york', 'was', 'there'],["i","love","new","york"],["new","york","is","great"]]

I use我用

bigram = gensim.models.Phrases(words, min_count=1, threshold=1)
bigram_mod = gensim.models.phrases.Phraser(bigram)
words_bigram = [bigram_mod[doc] for doc in words]

This creates a list of unigrams and bigrams as follows:这将创建一个 unigrams 和 bigrams 列表,如下所示:

[['the', 'mayor', 'of', 'new_york', 'was', 'there'],
 ['i', 'love', 'new_york'],
 ['new_york', 'is', 'great']]

My question is, is there a way (other than regular expressions) to extract strictly the bigrams, so that in this example only "new_york" would be a result?我的问题是,有没有办法(除了正则表达式)严格提取二元组,以便在这个例子中只有“new_york”会是结果?

It's not a built-in option of the gensim Phrases functionality.它不是 gensim Phrases功能的内置选项。

If we can assume none of your original unigrams had the '_' character in them, a step to select only tokens with a '_' shouldn't be too expensive (and doesn't need full regular expressions).如果我们可以假设您的原始 unigram 中没有一个包含'_'字符,那么仅选择带有'_'标记的步骤应该不会太昂贵(并且不需要完整的正则表达式)。 For example, your last line could be:例如,您的最后一行可能是:

words_bigram = [ [token for token in bigram_mod[doc] if '_' in token] for doc in words ]

(You could change the joining character if for some reason there were underscores in your unigrams, and you didn't want those confused with Phrases -combined bigrams.) (如果出于某种原因,您的 unigrams 中有下划线,您可以更改连接字符,并且您不希望那些与Phrases组合的 bigrams 混淆。)

If none of that is good enough, you could potentially look at the code in gensim which actually scores & combines unigrams into bigrams...如果这些都不够好,您可能会查看 gensim 中的代码,该代码实际上将 unigrams 评分并组合成 bigrams ......

https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300 https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/phrases.py#L300

...and either extend that module with your extra needed option, or mimic its behavior outside the class in your own code. ...并且要么使用您额外需要的选项扩展该模块,要么在您自己的代码中在类之外模仿其行为。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从文档语料库/dataframe 列中从预先列出的一元组中获取单词的二元组/三元组 - How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column 在python中相交二元三元组 - Intersect bigram trigram in python 如何按月计算单词(unigram/bigram/trigram)的出现次数? - How to count occurrence of word (unigram/bigram/trigram) by month? 如果给定二元组的概率为0,如何找到二元组的困惑度 - How to find perplexity of bigram if probability of given bigram is 0 如何创建二元矩阵? - how to create the bigram matrix? 将三元组,双胞胎和非语言与文本相匹配; 如果unigram或bigram是已经匹配的trigram的子串,则传递; 蟒蛇 - Match trigrams, bigrams, and unigrams to a text; if unigram or bigram a substring of already matched trigram, pass; python 使用Witten Bell Smoothing在nltk中使用NgramModel训练和评估bigram / trigram分布 - Training and evaluating bigram/trigram distributions with NgramModel in nltk, using Witten Bell Smoothing 如何在 Python 中使用多处理生成大型语料库的 tfdf? - How to generate the tfdf of a large corpus using multiprocessing in Python? 我有 200 个印地语文本文件。 想要删除特殊字符的空格并在python中找到find bigram和trigram - i have 200 text file in hindi. want to remove white space the special character and find the find bigram and trigram in python 如何将bigram编程为python中的表? - How do I program bigram as a table in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM