[英]Initializing gensim objects on all spark worker nodes
I have a function and a UDF which I have created,我有一个函数和一个我创建的 UDF,
def test(string):
path_index=SparkFiles.get("corpus_final_production.index")
path_dictionary=SparkFiles.get('dictionary_production.gensim')
path_corpus=SparkFiles.get("corpus_final_production")
dictionary = corpora.Dictionary.load(path_dictionary)
corpus = corpora.MmCorpus(path_corpus)
tf_idf = gensim.models.TfidfModel(corpus)
index_tmpfile = get_tmpfile(path_index)
sims = gensim.similarities.Similarity(index_tmpfile,tf_idf[corpus],num_features=len(dictionary))
query_doc=word_tokenize(string.lower())
query_doc_bow=dictionary.doc2bow(query_doc)
query_doc_tf_idf=tf_idf[query_doc_bow]
sum_of_sims=np.sum(sims[query_doc_tf_idf], dtype=np.float32)
max_sims=np.amax(sims[query_doc_tf_idf])
max_count=np.count_nonzero(sims[query_doc_tf_idf] >= max_sims-0.05)
max_sims_origin=file_docs[np.argmax(sims[query_doc_tf_idf])]
return max_sims_origin
test_udf = udf(lambda x: test(x),StringType())
df_new = garuda.withColumn('max_sim_origin', test_udf(garuda.text))
It is working fine but as you see I am applying a rowwise action to the pyspark dataframe.它工作正常,但如您所见,我正在对 pyspark 数据框应用 rowwise 操作。 For every row, the dictionary corpus and sims get generated with the indexes which takes close to 6 minutes for every row.
对于每一行,字典语料库和 sims 都使用索引生成,每行需要接近 6 分钟。
Is there a way for me to initialize the dictionary, corpus and index files before onto every worker node instead of calling it in the UDF.有没有办法让我在每个工作节点之前初始化字典、语料库和索引文件,而不是在 UDF 中调用它。
I am new to spark so every little help will be appreciated我是新来的火花所以每一个小帮助都会受到赞赏
I have added all the dictionary and corpus files, as it is pregenerated using sc.addFile()我已经添加了所有的字典和语料库文件,因为它是使用 sc.addFile() 预先生成的
You may try the following:您可以尝试以下操作:
test
test
# Step 1
path_index=SparkFiles.get("corpus_final_production.index")
path_dictionary=SparkFiles.get('dictionary_production.gensim')
path_corpus=SparkFiles.get("corpus_final_production")
dictionary = corpora.Dictionary.load(path_dictionary)
corpus = corpora.MmCorpus(path_corpus)
tf_idf = gensim.models.TfidfModel(corpus)
index_tmpfile = get_tmpfile(path_index)
sims = gensim.similarities.Similarity(index_tmpfile,tf_idf[corpus],num_features=len(dictionary))
# Step 2
dictionaryBC = sparkSession.sparkContext.broadcast(dictionary)
tfidfBC = sparkSession.sparkContext.broadcast(tfidf)
simsBC = sparkSession.sparkContext.broadcast(sims)
# Step 3
def test(string):
dictionary = dictionaryBC.value
tfidf = tfidfBC.value
sims = simsBC.value
query_doc=word_tokenize(string.lower())
query_doc_bow=dictionary.doc2bow(query_doc)
query_doc_tf_idf=tf_idf[query_doc_bow]
sum_of_sims=np.sum(sims[query_doc_tf_idf], dtype=np.float32)
max_sims=np.amax(sims[query_doc_tf_idf])
max_count=np.count_nonzero(sims[query_doc_tf_idf] >= max_sims-0.05)
max_sims_origin=file_docs[np.argmax(sims[query_doc_tf_idf])]
return max_sims_origin
test_udf = udf(lambda x: test(x),StringType())
# Step 4
df_new = garuda.withColumn('max_sim_origin', test_udf(garuda.text))
Let me know if this works for you让我知道这是否适合您
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.