简体繁体 English

如何根据语料库的相关性制作一袋单词

[英]How to produce a bag of words depending on relevance across corpus

原文 2016-02-26 17:01:33 7 1 nlp/ tf-idf

I understand that TF-IDF(term frequency-inverse document frequency) is the solution here? 我知道TF-IDF（术语频率与文档频率成反比）是这里的解决方案？ But see, TF of the TF-IDF is specific to a single document only. 但是请注意，TF-IDF的TF仅特定于单个文档。 I need to produce a bag of words that are relevant to the WHOLE corpus. 我需要出示一整套与整个语料库相关的单词。 Am I doing this wrong or is there an alternative? 我做错了还是有其他选择？

1 个解决方案

You may be able to do this if you count the IDF on a different corpus. 如果将IDF计入其他语料库，则可以执行此操作。 A general corpus containing newswire texts may be suitable. 包含新闻通讯录文本的通用语料库可能是合适的。 Then you can treat your own corpus as a single document to count the TF. 然后，您可以将自己的语料库视为单个文档来计算TF。 You will also need a strategy for the words that are present in your corpus but not present in the external corpus as they won't have a IDF value. 您还将需要一种策略，以解决您的语料库中存在但外部语料库中不存在的单词，因为它们没有IDF值。 Finally, you can rank the words in your corpus according to the TF-IDF. 最后，您可以根据TF-IDF对语料库中的单词进行排名。

有没有办法只矢量化单词，即不是来自 python 中的语料库或词袋？ - Is there a way to vectorize only words i.e not from a corpus or bag of words in python?

如何用词袋处理词汇以外的词 - How to handle out of vocab words with bag of words

如何从语料库中删除无意义的单词？ - How to remove meaningless words from corpus?

如何使用Weka创建一个单词包？ - How to create a bag of words using Weka?

如何解决 nltk.corpus.words.words() 中的缺失词？ - How to solve missing words in nltk.corpus.words.words()?

如何标准化培训和测试用词袋？ - How to standardize the bag of words for train and test?

如何缩小词袋 model？ - How to shrink a bag-of-words model?

如何在数据库中存储词袋或嵌入 - How to store Bag of Words or Embeddings in a Database

VADER NLTK中的单词袋 - Bag of words in VADER NLTK

NLP-单词袋分类 - NLP - Bag of words classification

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有没有办法只矢量化单词，即不是来自 python 中的语料库或词袋？ - Is there a way to vectorize only words i.e not from a corpus or bag of words in python? 如何用词袋处理词汇以外的词 - How to handle out of vocab words with bag of words 如何从语料库中删除无意义的单词？ - How to remove meaningless words from corpus? 如何使用Weka创建一个单词包？ - How to create a bag of words using Weka? 如何解决 nltk.corpus.words.words() 中的缺失词？ - How to solve missing words in nltk.corpus.words.words()? 如何标准化培训和测试用词袋？ - How to standardize the bag of words for train and test? 如何缩小词袋 model？ - How to shrink a bag-of-words model? 如何在数据库中存储词袋或嵌入 - How to store Bag of Words or Embeddings in a Database VADER NLTK中的单词袋 - Bag of words in VADER NLTK NLP-单词袋分类 - NLP - Bag of words classification

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM