[英]Averaging vectors in Pyspark with lookup table
I'm trying to implement a simple Doc2Vec algorithm in PySpark using a pre-trained GloVe model from https://nlp.stanford.edu/projects/glove/ .我正在尝试使用来自https://nlp.stanford.edu/projects/glove/的预训练 GloVe 模型在 PySpark 中实现一个简单的 Doc2Vec 算法。
I have two RDDs:我有两个 RDD:
A pair RDD called documents
in the form (K:[V]) where K is the document ID, and [V] is a list of all the words in that document, for example ('testDoc1':'i am using spark') ('testDoc2':'testing spark')
一对 RDD 以 (K:[V]) 形式调用documents
,其中 K 是文档 ID,[V] 是该文档中所有单词的列表,例如('testDoc1':'i am using spark') ('testDoc2':'testing spark')
A pair RDD called words
representing the word embeddings in the form K:V where K is a word and V is the vector that represents the word, for example ('i', [0.1, 0.1, 0.1]) ('spark': [0.2, 0.2, 0.2]) ('am', [0.3, 0.3, 0.3]) ('testing', [0.5, 0.5, 0.5]) ('using', [0.4, 0.4, 0.4])
一对 RDD 称为words
表示 K:V 形式的词嵌入,其中 K 是词,V 是表示词的向量,例如('i', [0.1, 0.1, 0.1]) ('spark': [0.2, 0.2, 0.2]) ('am', [0.3, 0.3, 0.3]) ('testing', [0.5, 0.5, 0.5]) ('using', [0.4, 0.4, 0.4])
What is the correct way to iterate through the words in documents
to get an average vector sum for all of the words?遍历documents
的单词以获得所有单词的平均向量和的正确方法是什么? In the above example, the end result would look like: ('testDoc1':[0.25, 0.25, 0.25]) ('testDoc2':[0.35, 0.35, 0.35])
在上面的示例中,最终结果将如下所示: ('testDoc1':[0.25, 0.25, 0.25]) ('testDoc2':[0.35, 0.35, 0.35])
Suppose you have a function tokenize
that transforms the strings to a list of words.假设您有一个函数tokenize
将字符串转换为单词列表。 Then you can flatMap
documents
to get an RDD
of tuples (word, document id)
:然后你可以flatMap
documents
来获取元组的RDD
(word, document id)
:
flattened_docs = documents.flatMap(lambda x: [(word, x[0]) for word in tokenize(x[1])])
Then joining with words
will give you (word, (document id, vector))
tuples, and you can drop the words at this point:然后加入words
会给你(word, (document id, vector))
元组,此时你可以删除单词:
doc_vectors = flattened_docs.join(words).values
Note that this is an inner join, so you're throwing away an words that do not have embeddings.请注意,这是一个内部连接,因此您将丢弃没有嵌入的单词。 Since you presumably want to count those words in your average, a left join is likely more appropriate and you'll then have to replace any resulting None
s with the zero vector (or whatever vector of your choice).由于您可能想将这些单词计入您的平均值,因此左连接可能更合适,然后您必须用零向量(或您选择的任何向量)替换任何结果None
s。
We can group by document id to get an rdd of (document id, [list of vectors])
and then average (I'll assume you have a function called average
).我们可以按文档 id 分组以获得(document id, [list of vectors])
的 rdd,然后(document id, [list of vectors])
平均值(我假设您有一个名为average
的函数)。
final_vectors = doc_vectors.groupByKey().mapValues(average)
(Please excuse my Scala-influenced Python. It's been a while since I've used pyspark and I haven't checked if it's flatMap
or flat_map
and so on.) (请原谅我受 Scala 影响的 Python。自从我使用 pyspark 已经有一段时间了,我还没有检查它是flatMap
还是flat_map
等等。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.