ValueError：当提供 input_signature 时，Python function 的所有输入必须可转换为张量：

Question

我有这样的 pyspark df：

id        desc
1      abd hdbh jbj
2      sgjhd jhdgh gjhg
3      bvj hvhgvgh
4      jkjb bhj

现在我想将我的desc列转换为向量，所以我使用 Google 句子编码器作为 udf，这是我的代码：

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)

def embedding(input):
    return (model[input])

df.withColumn("Embedding", list(embedding(f.lit("desc"))))

这是错误日志：

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_13810/1173342766.py in <module>
----> 1 df_shirt_sample.withColumn("Embedding", list(embedding(f.lit("desc"))))

/tmp/ipykernel_13810/446837446.py in embedding(input)
      1 def embedding(input):
----> 2     return (model(input))

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py in _call_attribute(instance, *args, **kwargs)
    684 
    685 def _call_attribute(instance, *args, **kwargs):
--> 686   return instance.__call__(*args, **kwargs)
    687 
    688 

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/tensorflow/python/util/traceback_utils.py in error_handler(*args, **kwargs)
    151     except Exception as e:
    152       filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153       raise e.with_traceback(filtered_tb) from None
    154     finally:
    155       del filtered_tb

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/tensorflow/python/eager/function_spec.py in _convert_inputs_to_signature(inputs, input_signature, flat_input_signature)
    521         need_packing = True
    522       except ValueError:
--> 523         raise ValueError("When input_signature is provided, all inputs to "
    524                          "the Python function must be convertible to "
    525                          "tensors:\n"

ValueError: When input_signature is provided, all inputs to the Python function must be convertible to tensors:
  inputs: (
    Column<b'desc'>)
  input_signature: (
    TensorSpec(shape=<unknown>, dtype=tf.string, name=None)).

有人可以告诉我我做错了什么

Answer 1

在这篇文章之后，我尝试从 UDF 加载 model： Spark 广播训练有素的 tensorflow SavedModel 。 这将在每个工作人员上加载 model，然后您可以进行预测。

例子：

import tensorflow_hub as hub
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, FloatType


def embedding(x):
    model_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
    model = hub.load(model_url, tags=["serve"])
    return model([x])


@F.udf(returnType=ArrayType(ArrayType(FloatType())))
def infer(data):
    outputs = embedding(data)
    return outputs.numpy().tolist()


spark = SparkSession.builder.getOrCreate()
data = [{"text": "abd hdbh jbj"}]
df = spark.createDataFrame(data=data)
df = df.withColumn("embedding", infer("text"))
df.show(10)
df.printSchema()

这使：

+------------+--------------------+                                             
|        text|           embedding|
+------------+--------------------+
|abd hdbh jbj|[[0.054931916, -0...|
+------------+--------------------+

root
 |-- text: string (nullable = true)
 |-- embedding: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: float (containsNull = true)

查看使用 PySpark 在 Spark 上部署 TF 2.0 SavedModel 的指南： https://github.com/tensorflow/tensorflow/issues/31421

Another way to do this is to use Petastorm to convert the Spark Dataframe to TensorFlow Dataframe and then feed it to a distributed model, please see: https://www.databricks.com/notebooks/simple-aws/petastorm-spark-converter -tensorflow.html

ValueError：当提供 input_signature 时，Python function 的所有输入必须可转换为张量：

问题描述

1 个解决方案

解决方案1
0 2022-09-02 11:26:49

ValueError：当提供 input_signature 时，Python function 的所有输入必须可转换为张量：

问题描述

1 个解决方案

解决方案1 0 2022-09-02 11:26:49

解决方案1
0 2022-09-02 11:26:49