[英]Creating a sentence-transformer model in Spark Mllib
我使用句子轉換器庫中的預訓練 model 來檢查兩個句子之間的相似度。 現在我需要使用 spark mllib 來實現這個特定的 model。 有什么建議么? 我真的很感激你能提供的任何幫助。
我發現一種可行的方法是使用Pandas UDF對文本進行編碼並返回嵌入。 然后可以將此嵌入列與 MLlib 一起使用。
import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType, StringType
from sentence_transformers import SentenceTransformer
# import sbert model
model = SentenceTransformer("all-MiniLM-L6-v2")
# sentences to encode
sentences = [
"This framework generates embeddings for each input sentence",
"Sentences are passed as a list of string.",
"The quick brown fox jumps over the lazy dog.",
]
# create spark df with sentences
data = spark.createDataFrame(sentences, StringType(), ["sentences"])
data.show()
# create a pandas udf that will encode the text and return an array of doubles
@F.pandas_udf(returnType=ArrayType(DoubleType()))
def encode(x: pd.Series) -> pd.Series:
return pd.Series(model.encode(x).tolist())
# apply udf and show
data.withColumn("embedding", encode("value")).show()
output
+--------------------+
| value|
+--------------------+
|This framework ge...|
|Sentences are pas...|
|The quick brown f...|
+--------------------+
+--------------------+--------------------+
| value| embedding|
+--------------------+--------------------+
|This framework ge...|[-0.0137173617258...|
|Sentences are pas...|[0.05645250156521...|
|The quick brown f...|[0.04393352568149...|
+--------------------+--------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.