使用特殊字符映射 Spark 數據框列

Question

執行 df.printSchema() 后我有以下架構

root
 |-- key:col1: string (nullable = true)
 |-- key:col2: string (nullable = true)
 |-- col3: string (nullable = true)
 |-- col4: string (nullable = true)
 |-- col5: string (nullable = true)

我需要使用列名訪問 key:col2 但以下行由於名稱中的 : 而出現錯誤

df.map(lambda row:row.key:col2)

我試過了

df.map(lambda row:row["key:col2"])

我可以輕松地使用 col3、col4 和 col5 獲取值

df.map(lambda row:row.col4).take(10)

Answer 1

我想你可能可以使用getattr ：

df.map(lambda row: getattr(row, 'key:col2'))

我不是pyspark的專家，所以我不知道這是否是最好的方法:-)。

您可能還可以使用operator.attrgetter ：

from operator import attrgetter
df.map(attrgetter('key:col2'))

IIRC，它在某些情況下的性能略好於lambda 。 在這種情況下，這可能比平時更明顯，因為它可以避免全局getattr名稱查找，在這種情況下，我認為它看起來也更好一些。

使用特殊字符映射 Spark 數據框列

問題描述

1 個解決方案

解決方案1
1 已采納 2016-04-20 06:13:16

使用特殊字符映射 Spark 數據框列

問題描述

1 個解決方案

解決方案1 1 已采納 2016-04-20 06:13:16

解決方案1
1 已采納 2016-04-20 06:13:16