简体   繁体   English

将文本文件中的单词列表转换为Word向量

[英]Convert list of words in Text file to Word Vectors

I have a text file with million of rows which I wanted to convert into word vectors and later on I can compare these vectors with a search keyword and see which all texts are closer to the search keyword. 我有一个包含数百万行的文本文件,我想将其转换为单词向量,以后我可以将这些向量与搜索关键字进行比较,然后查看哪些文本更接近搜索关键字。

My Dilemma is all the training files that I have seen for the Word2vec are in the form of paragraphs so that each word has some contextual meaning within that file. 我的困境是,我为Word2vec看到的所有培训文件都是段落形式的,因此每个单词在该文件中都有一定的上下文含义。 Now my file here is independent and contains different keywords in each row. 现在,我的文件在这里是独立的,并且每行包含不同的关键字。

My question is whether is it possible to create word embedding using this text file or not, if not then what's the best approach for searching a matching search keyword in this million of texts 我的问题是是否可以使用此文本文件创建单词嵌入,如果不能,那么在这百万个文本中搜索匹配的搜索关键字的最佳方法是什么

**My File Structure: ** **我的文件结构:**

Walmart
Home Depot
Home Depot
Sears
Walmart
Sams Club
GreenMile
Walgreen

Expected 预期

search Text : 'WAL'

Result from My File: 我的文件的结果:

WALGREEN
WALMART
WALMART

Embeddings 的嵌入

Lets step back and understand what is word2vec. 让我们退后一步,了解什么是word2vec。 Word2vec (like Glove, FastText etc) is a way to represent words as vectors. Word2vec(如Glove,FastText等)是一种将单词表示为矢量的方法。 ML models don't understand words they only understand numbers so when we are dealing with words we would want to convert them into numbers (vectors). ML模型不理解单词,它们仅理解数字,因此当我们处理单词时,我们希望将它们转换为数字(向量)。 One-hot encoding is one naive way of encoding words as vectors. 一键式编码是将字词编码为矢量的一种幼稚方式。 But for a large vocabulary one-hot encoding become too long. 但是对于大词汇量,单热编码变得太长。 Also there is no semantic relationship between one-hot encoded word. 在单热编码词之间也没有语义关系。

With DL came the distributed representation of words (called word embeddings). DL带来了单词的分布式表示(称为单词嵌入)。 One important property of these word embeddings is that the vector distance between related words is small compared to the distance between unrelated words. 这些单词嵌入的一个重要特性是,与不相关单词之间的距离相比,相关单词之间的向量距离较小。 ie distance(apple,orange) < distance(apple,cat) distance(apple,orange) < distance(apple,cat)

So how are these embedding model trained ? 那么如何训练这些嵌入模型? The embedding models are trained on (very) huge corpus of text. 嵌入模型是在(非常)巨大的文本语料库上训练的。 When you have huge corpus of text the model will learn that the apple are orange are used (many times) in same context. 当您拥有大量的文本集时,模型将了解到(在许多情况下)在同一上下文中使用了苹果(橙色)。 It will learn that the apple and orange are related. 它将了解到苹果和橙子是相关的。 So to train a good embedding model you need huge corpus of text (not independent words because independent words have no context). 因此,要训练一个好的嵌入模型,您需要庞大的文本语料库(不是独立的词,因为独立的词没有上下文)。

However, one rarely trains a word embedding model form scratch because good embedding model are available in open source. 但是,很少有人会从头开始训练单词嵌入模型,因为开放源代码中提供了很好的嵌入模型。 However, if your text is domain specific (say medical) then you do a transfer learning on openly available word embeddings. 但是,如果您的文本是特定领域的(例如医学),那么您将对公开可用的词嵌入进行迁移学习。

Out of vocabulary (OOV) words 词汇不足(OOV)单词

Word embedding like word2vec and Glove cannot return an embedding for OOV words. 像word2vec和Glove这样的词嵌入无法返回OOV词的嵌入。 However the embeddings like FastText (thanks to @gojom for pointing it out) handle OOV words by breaking them into n-grams of chars and build a vector by summing up subword vectors that would make up the word. 但是,像FastText这样的嵌入(感谢@gojom指出)可以通过将OOV单词分解为n个字符组成的字符来处理OOV单词,并通过汇总构成单词的子单词矢量来构建矢量。

Problem 问题

Coming to your problem, 遇到您的问题,

Case 1: lets say the user enters a word WAL , first of all it is not a valid English word so it will not be in vocabulary and it is hard to mind a meaning full vector to it. 情况1:假设用户输入单词WAL ,首先它不是一个有效的英语单词,因此该单词不会出现在词汇表中,因此很难介意它的含义。 Embeddings like FastText handling them by breaking it into n-grams. 像FastText这样的嵌入通过将它们分解为n-gram来处理它们。 This approach gives good embeddings for misspelled words or slang. 这种方法可以很好地嵌入拼写错误的单词或语。

Case 2: Lets say the user enters a word WALL and if you plan to use vector similarly to find closest word it will never be close to Walmart because semantically they are not related. 情况2:假设用户输入了一个单词WALL并且如果您打算类似地使用vector来查找最接近的单词,则它永远不会接近Walmart因为在语义上它们是不相关的。 It will rather be close to words like window, paint, door . 它宁可接近window, paint, door

Conclusion 结论

If your search is for semantically similar words, then solution using vector embeddings will be good. 如果您搜索的是语义相似的单词,那么使用向量嵌入的解决方案将是不错的选择。 On the other hand, if your search is based on lexicons then vectors embeddings will be of no help. 另一方面,如果您的搜索基于词典,则矢量嵌入将无济于事。

If you wanted to find walmart from a fragment like wal , you'd more likely use something like: 如果您想从wal类的片段中找到walmart ,则更可能使用以下方法:

  • a substring or prefix search through all entries; 子字符串或前缀搜索所有条目; or 要么
  • a reverse-index-of-character-n-grams; 字符n克的反向索引; or 要么
  • some sort of edit-distance calculated against all entries or a subset of likely candidates 针对所有条目或可能候选者的子集计算出的某种编辑距离

That is, from your example desired output, this is not really a job for word-vectors, even though some algorithms, like FastText, will be able to provide rough vectors for word-fragments based on their overlap with trained words. 就是说,从您的示例所需输出中,即使某些算法(如FastText)将能够基于单词片段与经过训练的单词的重叠来为单词片段提供粗糙的矢量,但这并不是单词矢量的真正工作。

If in fact you want to find similar stores, word-vectors might theoretically be useful. 实际上,如果您想查找类似的存储,则字向量在理论上可能很有用。 But the problem given your example input is that such word-vector algorithms require examples of tokens used in context , from sequences-of-tokens that co-appear in natural-language-like relationships. 但是在示例输入中给出的问题是,这样的词向量算法需要上下文中使用的标记示例,这些标记来自以自然语言相似关系共存的标记序列。 And you want lots of data featuring varied examples-in-context, to capture subtle gradations of mutual relationships. 而且,您需要大量具有各种上下文相关示例的数据,以捕获相互关系的细微层次。

While your existing single-column of short entity-names (stores) can't provide that, maybe you have something applicable elsewhere, if you have richer data sources. 虽然您现有的短实体名称(存储)的单列不能提供此功能,但如果您拥有更丰富的数据源,也许您可​​以在其他地方使用一些适用的名称。 Some ideas might be: 一些想法可能是:

  • lists of stores visited by a single customer 单个客户访问的商店列表
  • lists of stores carrying the same product/UPC 携带相同产品/ UPC的商店清单
  • text from a much larger corpus (such as web-crawled text, or maybe Wikipedia) in which there are sufficient in-context usages of each store-name. 来自更大语料库的文本(例如,网络抓取的文本,或者可能是Wikipedia),其中每个商店名称都有足够的上下文相关用法。 (You'd just throw out all the other words created from such training - but the vectors for your tokens-of-interest might still be of use in your domain.) (您只是将通过这种训练创建的所有其他单词都扔掉了-但是您感兴趣的标记的向量可能仍在您的域中有用。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从单词列表到文本文件中的单词搜索 - Searching from a list of word to words in a text file 将标记单词转换为单词列表 - Convert tokens words into list of word Python - 计算文本文件中也在csv单词列表中的单词数 - Python - count the number of words in text file that are also in csv word list Python - 在文本文件中查找单词列表的词频 - Python - Finding word frequencies of list of words in text file 读取文本文件并从关键字列表中查找某些单词 - Read text file and look for certain words from key word list 将元组列表(来自 itertools)转换为文本文件中的单词列表,Python - Convert a list of tuples (from itertools) into a list of words in a text file, Python 从文本文件中转换 4 个句子并将所有单词附加到一个新列表中而不重复单词 - Convert 4 sentences from text file and append all the words into a new list without repeating the words 对于文本文件中的每个单词,请提取周围的5个单词 - For each word in the text file, extract surrounding 5 words 提取文本文件中的第一个单词,然后提取相应的单词? - Extracting the first word in a text file and then the corresponding words? 删除标点符号后从文本文件中打印唯一单词列表,并找到最长的单词 - Print a list of unique words from a text file after removing punctuation, and find longest word
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM