简体繁体 English

没有图像的词袋

[英]Bag of Words without images

原文 2015-06-23 07:31:56 6 1 c++/ opencv/ image-processing/ machine-learning

I am trying to build a bag of words class wich can create a vocabulary and find a nearest word for the given vector in the vocabulary.我正在尝试构建一个词袋类，它可以创建一个词汇表并为词汇表中的给定向量找到最接近的词。 For example I load a float vector or a matrix and want to find the nearest word.例如，我加载一个浮点向量或矩阵并想找到最近的单词。

In openCV I only found the BOWImgDescriptorExtractor, but this works only with images.在 openCV 中，我只找到了 BOWImgDescriptorExtractor，但这仅适用于图像。

Can someone explain to me how I find the nearest word for a given vector in my vocabulary?有人可以向我解释如何在我的词汇表中找到与给定向量最接近的单词吗？ I read a lot about the FlannBasedMatcher and the BruteforceMatcher, but I have no clue how to convert the vector to a format for my vocabulary.我阅读了很多关于 FlannBasedMatcher 和 BruteforceMatcher 的内容，但我不知道如何将向量转换为我的词汇表的格式。

Thank you for your help感谢您的帮助

1 个解决方案

You want to convert text documents into vectors, where each feature corresponds to a word (or an n-gram, which is a series of n words), and the value for each feature is either the count of the word in the document, or its frequency, or better, it's tf-idf .您想将文本文档转换为向量，其中每个特征对应一个单词（或一个 n-gram，它是一系列 n 个单词），每个特征的值要么是文档中单词的计数，要么它的频率，或者更好的是tf-idf 。

Once you have the means to convert a document into a vector, than you can measure the distance between any two vectors.一旦您有办法将文档转换为向量，您就可以测量任意两个向量之间的距离。 These two vectors represent two different documents.这两个向量代表两个不同的文档。 In your case, one vector will be representing a document with a single word, and the other will be the text document you're interested in. To avoid the lengths of documents to play a role in the distance measurement, cosine distance is used a lot in text analysis, rather than euclidean distance .在您的情况下，一个向量将用一个单词表示一个文档，另一个将是您感兴趣的文本文档。为了避免文档的长度在距离测量中起作用，余弦距离被使用很多在文本分析，而不是欧几里德距离。

To find the nearest word to a given vector, you can basically do a brute force search by calculating the cosine distance between each word's vector and the query vector.要找到与给定向量最近的单词，您基本上可以通过计算每个单词的向量与查询向量之间的余弦距离来进行蛮力搜索。 The word that gives you the smallest distance is the winner.给你最小距离的词是赢家。

If you need to do this for a lot of vectors with a big vocabulary, there are algorithms to make this search much faster than brute force.如果您需要对大量词汇量较大的向量执行此操作，则有一些算法可以使此搜索比蛮力搜索快得多。 They involve building indexes ( spatial data structures ) that allow you to check the distance of a smaller subset to find the winner (you get to automatically eliminate a whole bunch of words without explicitly measuring the distance).它们涉及构建索引（空间数据结构），允许您检查较小子集的距离以找到获胜者（您可以在不明确测量距离的情况下自动消除一大堆单词）。 If you're willing to lose a little bit of accuracy to get much much faster in finding the nearest word, there are great algorithms for that as well .如果您愿意牺牲一点准确性以更快地找到最接近的单词，那么也有很好的算法。

To implement a text document to vector converter, first you need to go over the entire corpus and record every unique word, making a hash table that defines an integer id for each word you see.要实现文本文档到向量转换器，首先需要遍历整个语料库并记录每个唯一的单词，创建一个哈希表，为您看到的每个单词定义一个整数 id。 This is your vocabulary.这是你的词汇。 Let's say there are 50K words.假设有 50K 字。 Each of your documents will be represented by a vector that's 50K long.您的每个文档都将由一个 50K 长的向量表示。 Each vector will be very sparse, you will have 0 for most features (most documents will only have a tiny portion of your entire vocabulary).每个向量都非常稀疏，大多数特征都为 0（大多数文档只包含整个词汇表的一小部分）。 You will go over each document and calculate the value (either count, frequency, or tf-idf) for each word in the document, and record this value in the vector under the column related to the word in question.您将查看每个文档并计算文档中每个单词的值（计数、频率或 tf-idf），并将该值记录在与相关单词相关的列下的向量中。 This is how you convert text into a vector.这就是将文本转换为矢量的方式。 A word by itself is the simplest vector, of course: a 1 in the corresponding column and zero everywhere else.一个词本身就是最简单的向量，当然：相应列中的值为 1，其他位置为 0。

Like stan0 mentioned, word2vec is open source and already does all this, so I'd give it a try.就像提到的 stan0 一样， word2vec是开源的并且已经完成了所有这些，所以我会尝试一下。 Here is a tutorial to get you started. 这是一个可以帮助您入门的教程。