Pyspark 使用 Pandas UDF 进行流式传输

Question

I am new to Spark Streaming and Pandas UDF.我是 Spark Streaming 和 Pandas UDF 的新手。 I am working on pyspark consumer from kafka, payload is of xml format and trying to parse the incoming xml by applying pandas udf我正在处理来自 kafka 的 pyspark 消费者，有效负载为 xml 格式，并尝试通过应用 pandas udf 来解析传入的 xml

@pandas_udf("col1 string, col2 string",PandasUDFType.GROUPED_MAP)
def test_udf(df):
    import xmltodict
    from collections import MutableMapping 
    xml_str=df.iloc[0,0]
    df_col=['col1', 'col2']
    doc=xmltodict.parse(xml_str,dict_constructor=dict)
    extract_needed_fields = { k:doc[k] for k in df_col }
    return pd.DataFrame( [{'col1': 'abc', 'col2': 'def'}] , index=[0] , dtype="string" )

data=df.selectExpr("CAST(value AS STRING) AS value") 
data.groupby("value").apply(test_udf).writeStream.format("console").start()

I get the below error我收到以下错误

  File "pyarrow/array.pxi", line 859, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 104, in pyarrow.lib._handle_arrow_array_protocol
ValueError: Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.

Is this the right approach?这是正确的方法吗？ What am i doing wrong我究竟做错了什么

Answer 1

While converting a pandas dataframe to a pyspark one, I stumbled upon this error as well:在将 pandas dataframe 转换为 pyspark 时，我也偶然发现了这个错误：

Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol

My pandas dataframe had datetime-like values that I tried to convert to "string".我的 pandas dataframe 有类似日期时间的值，我试图将其转换为“字符串”。 I initially used the astype("string") method, which looked like this:我最初使用astype("string")方法，它看起来像这样：

df["time"] = (df["datetime"].dt.time).astype("string")

When I tried to get the info of this dataframe, it seemed like it was indeed converted to a string type:当我试图获取这个dataframe的信息时，似乎确实被转换为字符串类型：

df.info(verbose=True)
> ...
>  #   Column    Non-Null Count   Dtype
> ...
>  6   time      295452 non-null  string

But the error kept coming back to me.但是错误不断地回到我身边。

Solution解决方案

To avoid it, I instead went on to use the apply(str) method:为了避免这种情况，我继续使用apply(str)方法：

df["time"] = (df["datetime"].dt.time).apply(str)

Which gave me a type of object这给了我object的类型

df.info(verbose=True)
> ...
>  #   Column    Non-Null Count   Dtype
> ...
>  6   time      295452 non-null  object

After that, the conversion was successful之后，转换成功

spark.createDataFrame(df)
# DataFrame[datetime: string, date: string, year: bigint, month: bigint, day: bigint, day_name: string, time: string, hour: bigint, minute: bigint]

Answer 2

It looks like, as if this is a more kind of undocumented limitation than a bug.看起来，这似乎是一种比错误更多的无证限制。 You cannot use any pandas type which will be stored as an array object, which has a method named __arrow_array__ , because pyspark always defines a mask .您不能使用任何 pandas 类型，它将存储为数组 object，它有一个名为__arrow_array__的方法，因为pyspark 总是定义一个掩码。 The string type you used, is stored in a StringArray, which is such a case .你使用的string类型，存储在一个StringArray中，就是这样一种情况。 After I converted the string dtype into object, the error went away.在我将字符串 dtype 转换为 object 后，错误消失了。

Pyspark 使用 Pandas UDF 进行流式传输

问题描述

2 个解决方案

解决方案1
1 2022-02-10 17:28:08

Solution解决方案

解决方案2
0 2021-07-29 12:05:16

Pyspark 使用 Pandas UDF 进行流式传输

问题描述

2 个解决方案

解决方案1 1 2022-02-10 17:28:08

Solution解决方案

解决方案2 0 2021-07-29 12:05:16

解决方案1
1 2022-02-10 17:28:08

解决方案2
0 2021-07-29 12:05:16