簡體   English   中英

使用UDF從Apache Spark中的其他列創建新列

[英]Making new Column from other column in Apache Spark using UDF

我正在嘗試從Apache Spark中的另一列創建新列。

數據(大寫縮寫)看起來像

Date    Day_of_Week
2018-05-26T00:00:00.000+0000    5
2018-05-05T00:00:00.000+0000    6

並且應該看起來像

Date    Day_of_Week    Weekday
2018-05-26T00:00:00.000+0000    5    Thursday
2018-05-05T00:00:00.000+0000    6    Friday

我已經嘗試過手冊https://docs.databricks.com/spark/latest/spark-sql/udf-python.html#register-the-function-as-a-udf的建議,以及如何將常量值傳遞給Python UDF? PySpark從TimeStampType列向DataFrame添加一列

結果是:

def int2day (day_int):
  if day_int == 1:
    return 'Sunday'
  elif day_int == 2:
    return 'Monday'
  elif day_int == 3:
    return 'Tuesday'
  elif day_int == 4:
    return 'Wednesday'
  elif day_int == 5:
    return 'Thursday'
  elif day_int == 6:
    return 'Friday'
  elif day_int == 7:
    return 'Saturday'
  else:
    return 'FAIL'

spark.udf.register("day", int2day, IntegerType())
df2 = df.withColumn("Day", day("Day_of_Week"))

並給出了一個長錯誤

SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 8, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 262, in main
    process()
  File "/databricks/spark/python/pyspark/worker.py", line 257, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/databricks/spark/python/pyspark/serializers.py", line 325, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/databricks/spark/python/pyspark/serializers.py", line 141, in dump_stream
    self._write_with_length(obj, stream)
  File "/databricks/spark/python/pyspark/serializers.py", line 151, in _write_with_length
    serialized = self.dumps(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 556, in dumps
    return pickle.dumps(obj, protocol)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

我看不到如何應用如何將常量值傳遞給Python UDF? 在這里,因為他們的例子要簡單得多(只有true或false)

我也嘗試過使用地圖功能,例如在PySpark中,將一個列從TimeStampType列添加到DataFrame中

df3 = df2.withColumn("weekday", map(lambda x: int2day, col("Date")))只是說TypeError: argument 2 to map() must support iteration但我認為col 確實支持迭代。

我已經在線閱讀了所有可以找到的示例。 我看不到如何將其他問題問到我的案件中。

如何使用另一列的功能添加另一列?

您根本不需要在這里使用UDF即可完成您想做的事情。 您可以利用內置的pyspark date_format函數提取列中給定日期的星期幾的名稱。

import pyspark.sql.functions as func
df = df.withColumn("day_of_week", func.date_format(func.col("Date"), "EEEE"))

結果是將一個新列添加到您的數據day_of_week ,稱為day_of_week ,它將根據Date列中的值顯示星期日,星期一,星期二等。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM