Apache Spark -- 將 UDF 的結果分配給多個數據框列

Question

我正在使用 pyspark，使用 spark-csv 將大型 csv 文件加載到數據框中，作為預處理步驟，我需要對其中一列（包含 json 字符串）中可用的數據應用各種操作. 這將返回 X 個值，每個值都需要存儲在它們自己單獨的列中。

該功能將在 UDF 中實現。 但是，我不確定如何從該 UDF 返回值列表並將這些值輸入到各個列中。 下面是一個簡單的例子：

(...)
from pyspark.sql.functions import udf
def udf_test(n):
    return [n/2, n%2]

test_udf=udf(udf_test)


df.select('amount','trans_date').withColumn("test", test_udf("amount")).show(4)

這會產生以下結果：

+------+----------+--------------------+
|amount|trans_date|                test|
+------+----------+--------------------+
|  28.0|2016-02-07|         [14.0, 0.0]|
| 31.01|2016-02-07|[15.5050001144409...|
| 13.41|2016-02-04|[6.70499992370605...|
| 307.7|2015-02-17|[153.850006103515...|
| 22.09|2016-02-05|[11.0450000762939...|
+------+----------+--------------------+
only showing top 5 rows

將 udf 返回的兩個（在本例中）值存儲在單獨的列上的最佳方法是什么？ 現在它們被輸入為字符串：

df.select('amount','trans_date').withColumn("test", test_udf("amount")).printSchema()

root
 |-- amount: float (nullable = true)
 |-- trans_date: string (nullable = true)
 |-- test: string (nullable = true)

Answer 1

無法從單個 UDF 調用創建多個頂級列，但您可以創建一個新的struct 。 它需要一個具有指定returnType的 UDF：

from pyspark.sql.functions import udf
from pyspark.sql.types import StructType, StructField, FloatType

schema = StructType([
    StructField("foo", FloatType(), False),
    StructField("bar", FloatType(), False)
])

def udf_test(n):
    return (n / 2, n % 2) if n and n != 0.0 else (float('nan'), float('nan'))

test_udf = udf(udf_test, schema)
df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])

foobars = df.select(test_udf("y").alias("foobar"))
foobars.printSchema()
## root
##  |-- foobar: struct (nullable = true)
##  |    |-- foo: float (nullable = false)
##  |    |-- bar: float (nullable = false)

您可以使用簡單的select進一步扁平化架構：

foobars.select("foobar.foo", "foobar.bar").show()
## +---+---+
## |foo|bar|
## +---+---+
## |1.0|0.0|
## |1.5|1.0|
## +---+---+

另請參閱從 Spark DataFrame 中的單個列派生多個列

Answer 2

您可以使用 flatMap 一次性獲取所需數據框的列

df=df.withColumn('udf_results',udf)  
df4=df.select('udf_results').rdd.flatMap(lambda x:x).toDF(schema=your_new_schema)

Apache Spark -- 將 UDF 的結果分配給多個數據框列

問題描述

2 個解決方案

解決方案1
84 已采納 2016-02-10 18:59:36

解決方案2
1 2019-11-10 12:54:55

Apache Spark -- 將 UDF 的結果分配給多個數據框列

問題描述

2 個解決方案

解決方案1 84 已采納 2016-02-10 18:59:36

解決方案2 1 2019-11-10 12:54:55

解決方案1
84 已采納 2016-02-10 18:59:36

解決方案2
1 2019-11-10 12:54:55