PySpark-將列表作為參數傳遞給UDF +迭代數據框列添加

Question

我從一個鏈接借了這個例子！

我想了解為什么數據幀a -有過欄“之后category ”看似添加到它，不能在后續操作中被引用。 數據框是a莫名其妙不變？ 還有另一種對數據框a進行操作的方式，以便后續操作可以訪問“ category ”列嗎？ 謝謝你的幫助; 我仍在學習中。 現在，可以一次添加所有列以避免錯誤，但這不是我想要在此處執行的操作。

#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80),("E",0)],["Letter", "distances"])
label_list = ["Great", "Good", "OK", "Please Move", "Dead"]

#Passing List as Default value to a variable
def cate( feature_list,label=label_list):
    if feature_list == 0:
        return label[4]
    else:  
        return 'I am not sure!'

def cate2( feature_list,label=label_list):
    if feature_list == 0:
        return label[4]
    elif feature_list.category=='I am not sure!':
        return 'Why not?'

udfcate = udf(cate, StringType())
udfcate2 = udf(cate2, StringType())

a.withColumn("category", udfcate("distances"))
a.show()
a.withColumn("category2", udfcate2("category")).show()
a.show()

我得到錯誤：

C:\Users\gowreden\AppData\Local\Continuum\anaconda3\python.exe C:/Users/gowreden/PycharmProjects/DRC/src/tester.py
2018-08-09 09:06:42 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+------+---------+--------------+
|Letter|distances|      category|
+------+---------+--------------+
|     A|       20|I am not sure!|
|     B|       30|I am not sure!|
|     D|       80|I am not sure!|
|     E|        0|          Dead|
+------+---------+--------------+

Traceback (most recent call last):
  File "C:\Programs\spark-2.3.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Programs\spark-2.3.1-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o34.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`category`' given input columns: [Letter, distances];;
'Project [Letter#0, distances#1L, cate('category) AS category2#20]
+- AnalysisBarrier
      +- LogicalRDD [Letter#0, distances#1L], false
....

Answer 1

我認為您的代碼有兩個問題：

首先，正如@pault所說， withColumn不在適當的位置，您需要相應地修改代碼。
其次，您的cate2函數不正確。 從某種意義上說，您將其應用於列category ，同時又請求將feature_list.category與某些內容進行比較。

您可能想要擺脫第一個功能，然后執行以下操作：

import pyspark.sql.functions as F

a=a.withColumn('category', F.when(a.distances==0, label_list[4]).otherwise('I am not sure!'))
a.show()

輸出：

+------+---------+--------------+
|Letter|distances|      category|
+------+---------+--------------+
|     A|       20|I am not sure!|
|     B|       30|I am not sure!|
|     D|       80|I am not sure!|
|     E|        0|          Dead|
+------+---------+--------------+

然后對第二個功能執行以下操作：

a=a.withColumn('category2', F.when(a.distances==0, label_list[4]).otherwise(F.when(a.category=='I am not sure!', 'Why not?')))
a.show()

輸出：

+------+---------+--------------+---------+
|Letter|distances|      category|category2|
+------+---------+--------------+---------+
|     A|       20|I am not sure!| Why not?|
|     B|       30|I am not sure!| Why not?|
|     D|       80|I am not sure!| Why not?|
|     E|        0|          Dead|     Dead|
+------+---------+--------------+---------+

PySpark-將列表作為參數傳遞給UDF +迭代數據框列添加

問題描述

1 個解決方案

解決方案1
1 2018-08-09 15:21:59

PySpark-將列表作為參數傳遞給UDF +迭代數據框列添加

問題描述

1 個解決方案

解決方案1 1 2018-08-09 15:21:59

解決方案1
1 2018-08-09 15:21:59