简体   繁体   English

如何根据语料库的相关性制作一袋单词

[英]How to produce a bag of words depending on relevance across corpus

I understand that TF-IDF(term frequency-inverse document frequency) is the solution here? 我知道TF-IDF(术语频率与文档频率成反比)是这里的解决方案? But see, TF of the TF-IDF is specific to a single document only. 但是请注意,TF-IDF的TF仅特定于单个文档。 I need to produce a bag of words that are relevant to the WHOLE corpus. 我需要出示一整套与整个语料库相关的单词。 Am I doing this wrong or is there an alternative? 我做错了还是有其他选择?

You may be able to do this if you count the IDF on a different corpus. 如果将IDF计入其他语料库,则可以执行此操作。 A general corpus containing newswire texts may be suitable. 包含新闻通讯录文本的通用语料库可能是合适的。 Then you can treat your own corpus as a single document to count the TF. 然后,您可以将自己的语料库视为单个文档来计算TF。 You will also need a strategy for the words that are present in your corpus but not present in the external corpus as they won't have a IDF value. 您还将需要一种策略,以解决您的语料库中存在但外部语料库中不存在的单词,因为它们没有IDF值。 Finally, you can rank the words in your corpus according to the TF-IDF. 最后,您可以根据TF-IDF对语料库中的单词进行排名。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM