PySpark-將DF列合並為命名的StructType

Question

我期待一個PySpark數據幀的多列組合成的一列StructType 。

假設我有一個像這樣的數據框：

columns = ['id', 'dogs', 'cats']
vals = [(1, 2, 0),(2, 0, 1)]
df = sqlContext.createDataFrame(vals, columns)

我希望得到的數據框類似於此（不是像它實際打印的那樣，而是讓您了解如果您還不熟悉StructType的意思）：

id | animals
1  | dogs=2, cats=0
2  | dogs=0, cats=1

現在，我可以完成以下任務：

StructType(
    [StructField('dogs', IntegerType(), True),
    [StructField('cats', IntegerType(), True)
)

但是，在我的udf末尾，我寧願只使用一個函數來完成它。 如果不存在，我會感到驚訝。

Answer 1

如果需要map列 ：創建以列名作為鍵的文字列，然后使用create_map函數構造所需的地圖列：

from pyspark.sql.functions import create_map, lit
new_df = df.select(
    'id', 
     create_map(lit('dogs'), 'dogs', lit('cats'), 'cats').alias('animals')
     #                key  :  val,        key   :   val
)

new_df.show(2, False)
#+---+----------------------+
#|id |animals               |
#+---+----------------------+
#|1  |[dogs -> 2, cats -> 0]|
#|2  |[dogs -> 0, cats -> 1]|
#+---+----------------------+

new_df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- animals: map (nullable = false)
# |    |-- key: string
# |    |-- value: long (valueContainsNull = true)

如果您需要一個struct列 ：使用struct函數：

from pyspark.sql.functions import struct
new_df = df.select('id', struct('dogs', 'cats').alias('animals'))
new_df.show(2, False)
#+---+-------+
#|id |animals|
#+---+-------+
#|1  |[2, 0] |
#|2  |[0, 1] |
#+---+-------+

new_df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- animals: struct (nullable = false)
# |    |-- dogs: long (nullable = true)
# |    |-- cats: long (nullable = true)

PySpark-將DF列合並為命名的StructType

問題描述

1 個解決方案

解決方案1
5 已采納 2018-08-14 17:04:29

PySpark-將DF列合並為命名的StructType

問題描述

1 個解決方案

解決方案1 5 已采納 2018-08-14 17:04:29

解決方案1
5 已采納 2018-08-14 17:04:29