在所有 Spark 工作节点上初始化 gensim 对象

Question

I have a function and a UDF which I have created,我有一个函数和一个我创建的 UDF，

    def test(string):
        path_index=SparkFiles.get("corpus_final_production.index")
        path_dictionary=SparkFiles.get('dictionary_production.gensim')
        path_corpus=SparkFiles.get("corpus_final_production")
        dictionary = corpora.Dictionary.load(path_dictionary)
        corpus = corpora.MmCorpus(path_corpus)
        tf_idf = gensim.models.TfidfModel(corpus)
        index_tmpfile = get_tmpfile(path_index)
        sims = gensim.similarities.Similarity(index_tmpfile,tf_idf[corpus],num_features=len(dictionary))
        query_doc=word_tokenize(string.lower())
        query_doc_bow=dictionary.doc2bow(query_doc)
        query_doc_tf_idf=tf_idf[query_doc_bow]
        sum_of_sims=np.sum(sims[query_doc_tf_idf], dtype=np.float32)
        max_sims=np.amax(sims[query_doc_tf_idf])
        max_count=np.count_nonzero(sims[query_doc_tf_idf] >= max_sims-0.05)
        max_sims_origin=file_docs[np.argmax(sims[query_doc_tf_idf])]
        return max_sims_origin

   test_udf = udf(lambda x: test(x),StringType())
   df_new = garuda.withColumn('max_sim_origin', test_udf(garuda.text))

It is working fine but as you see I am applying a rowwise action to the pyspark dataframe.它工作正常，但如您所见，我正在对 pyspark 数据框应用 rowwise 操作。 For every row, the dictionary corpus and sims get generated with the indexes which takes close to 6 minutes for every row.对于每一行，字典语料库和 sims 都使用索引生成，每行需要接近 6 分钟。

Is there a way for me to initialize the dictionary, corpus and index files before onto every worker node instead of calling it in the UDF.有没有办法让我在每个工作节点之前初始化字典、语料库和索引文件，而不是在 UDF 中调用它。

I am new to spark so every little help will be appreciated我是新来的火花所以每一个小帮助都会受到赞赏

I have added all the dictionary and corpus files, as it is pregenerated using sc.addFile()我已经添加了所有的字典和语料库文件，因为它是使用 sc.addFile() 预先生成的

Answer 1

You may try the following:您可以尝试以下操作：

Creating your gensim components once一次创建您的 gensim 组件
Broadcasting reusable gensim components to every worker向每个工人广播可重用的 gensim 组件
Modifying your UDF to utilize broadcasted components修改您的 UDF 以利用广播组件
Apply your modified UDF test应用修改后的 UDF test

# Step 1
path_index=SparkFiles.get("corpus_final_production.index")
path_dictionary=SparkFiles.get('dictionary_production.gensim')
path_corpus=SparkFiles.get("corpus_final_production")
dictionary = corpora.Dictionary.load(path_dictionary)
corpus = corpora.MmCorpus(path_corpus)
tf_idf = gensim.models.TfidfModel(corpus)
index_tmpfile = get_tmpfile(path_index)
sims = gensim.similarities.Similarity(index_tmpfile,tf_idf[corpus],num_features=len(dictionary))

# Step 2
dictionaryBC = sparkSession.sparkContext.broadcast(dictionary)
tfidfBC = sparkSession.sparkContext.broadcast(tfidf)
simsBC = sparkSession.sparkContext.broadcast(sims)

# Step 3
def test(string):
    dictionary = dictionaryBC.value
    tfidf = tfidfBC.value
    sims = simsBC.value

    query_doc=word_tokenize(string.lower())
    query_doc_bow=dictionary.doc2bow(query_doc)
    query_doc_tf_idf=tf_idf[query_doc_bow]
    sum_of_sims=np.sum(sims[query_doc_tf_idf], dtype=np.float32)
    max_sims=np.amax(sims[query_doc_tf_idf])
    max_count=np.count_nonzero(sims[query_doc_tf_idf] >= max_sims-0.05)
    max_sims_origin=file_docs[np.argmax(sims[query_doc_tf_idf])]
    
    return max_sims_origin

test_udf = udf(lambda x: test(x),StringType())

# Step 4
df_new = garuda.withColumn('max_sim_origin', test_udf(garuda.text))

Let me know if this works for you让我知道这是否适合您

在所有 Spark 工作节点上初始化 gensim 对象

问题描述

1 个解决方案

解决方案1
0 2021-10-17 13:27:21

在所有 Spark 工作节点上初始化 gensim 对象

问题描述

1 个解决方案

解决方案1 0 2021-10-17 13:27:21

解决方案1
0 2021-10-17 13:27:21