使用查找表在 Pyspark 中平均向量

Question

I'm trying to implement a simple Doc2Vec algorithm in PySpark using a pre-trained GloVe model from https://nlp.stanford.edu/projects/glove/ .我正在尝试使用来自https://nlp.stanford.edu/projects/glove/的预训练 GloVe 模型在 PySpark 中实现一个简单的 Doc2Vec 算法。

I have two RDDs:我有两个 RDD：

A pair RDD called documents in the form (K:[V]) where K is the document ID, and [V] is a list of all the words in that document, for example ('testDoc1':'i am using spark') ('testDoc2':'testing spark')一对 RDD 以 (K:[V]) 形式调用documents ，其中 K 是文档 ID，[V] 是该文档中所有单词的列表，例如('testDoc1':'i am using spark') ('testDoc2':'testing spark')
A pair RDD called words representing the word embeddings in the form K:V where K is a word and V is the vector that represents the word, for example ('i', [0.1, 0.1, 0.1]) ('spark': [0.2, 0.2, 0.2]) ('am', [0.3, 0.3, 0.3]) ('testing', [0.5, 0.5, 0.5]) ('using', [0.4, 0.4, 0.4])一对 RDD 称为words表示 K:V 形式的词嵌入，其中 K 是词，V 是表示词的向量，例如('i', [0.1, 0.1, 0.1]) ('spark': [0.2, 0.2, 0.2]) ('am', [0.3, 0.3, 0.3]) ('testing', [0.5, 0.5, 0.5]) ('using', [0.4, 0.4, 0.4])

What is the correct way to iterate through the words in documents to get an average vector sum for all of the words?遍历documents的单词以获得所有单词的平均向量和的正确方法是什么？ In the above example, the end result would look like: ('testDoc1':[0.25, 0.25, 0.25]) ('testDoc2':[0.35, 0.35, 0.35])在上面的示例中，最终结果将如下所示： ('testDoc1':[0.25, 0.25, 0.25]) ('testDoc2':[0.35, 0.35, 0.35])

Answer 1

Suppose you have a function tokenize that transforms the strings to a list of words.假设您有一个函数tokenize将字符串转换为单词列表。 Then you can flatMap documents to get an RDD of tuples (word, document id) :然后你可以flatMap documents来获取元组的RDD (word, document id) ：

flattened_docs = documents.flatMap(lambda x: [(word, x[0]) for word in tokenize(x[1])])

Then joining with words will give you (word, (document id, vector)) tuples, and you can drop the words at this point:然后加入words会给你(word, (document id, vector))元组，此时你可以删除单词：

doc_vectors = flattened_docs.join(words).values

Note that this is an inner join, so you're throwing away an words that do not have embeddings.请注意，这是一个内部连接，因此您将丢弃没有嵌入的单词。 Since you presumably want to count those words in your average, a left join is likely more appropriate and you'll then have to replace any resulting None s with the zero vector (or whatever vector of your choice).由于您可能想将这些单词计入您的平均值，因此左连接可能更合适，然后您必须用零向量（或您选择的任何向量）替换任何结果None s。

We can group by document id to get an rdd of (document id, [list of vectors]) and then average (I'll assume you have a function called average ).我们可以按文档 id 分组以获得(document id, [list of vectors])的 rdd，然后(document id, [list of vectors])平均值（我假设您有一个名为average的函数）。

final_vectors = doc_vectors.groupByKey().mapValues(average)

(Please excuse my Scala-influenced Python. It's been a while since I've used pyspark and I haven't checked if it's flatMap or flat_map and so on.) （请原谅我受 Scala 影响的 Python。自从我使用 pyspark 已经有一段时间了，我还没有检查它是flatMap还是flat_map等等。）

使用查找表在 Pyspark 中平均向量

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-03-01 00:09:53

使用查找表在 Pyspark 中平均向量

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-03-01 00:09:53

解决方案1
3 已采纳 2018-03-01 00:09:53