pyspark - 使用 ArrayType 列折疊和求和

Question

我正在嘗試按元素求和，並且我創建了這個虛擬 df。 output 應該是[10,4,4,1]

from pyspark.sql.types import StructType,StructField, StringType, IntegerType, ArrayType
data = [
    ("James",[1,1,1,1]),
    ("James",[2,1,1,0]),
    ("James",[3,1,1,0]),
    ("James",[4,1,1,0])
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("scores", ArrayType(IntegerType()), True) \
  ])
 
df = spark.createDataFrame(data=data,schema=schema)

posexplode 有效，但我的真實 df 太大，所以我嘗試使用 fold，但它給了我一個錯誤。 有任何想法嗎？ 謝謝！

vec_df = df.select("scores")
vec_sums = vec_df.rdd.fold([0]*4, lambda a,b: [x + y for x, y in zip(a, b)])

<listcomp> 中的文件“<ipython-input-115-9b470dedcfef>”，第 2 行

類型錯誤：+ 不支持的操作數類型：“int”和“list”

Answer 1

您需要 map 在fold之前將行的 RDD 轉換為列表的 RDD：

vec_sums = vec_df.rdd.map(lambda x: x[0]).fold([0]*4, lambda a,b: [x + y for x, y in zip(a, b)])

為了幫助理解，您可以查看 RDD 的外觀。

>>> vec_df.rdd.collect()
[Row(scores=[1, 1, 1, 1]), Row(scores=[2, 1, 1, 0]), Row(scores=[3, 1, 1, 0]), Row(scores=[4, 1, 1, 0])]

>>> vec_df.rdd.map(lambda x: x[0]).collect()
[[1, 1, 1, 1], [2, 1, 1, 0], [3, 1, 1, 0], [4, 1, 1, 0]]

所以你可以想象vec_df.rdd包含一個嵌套列表，需要在fold之前取消嵌套。

pyspark - 使用 ArrayType 列折疊和求和

問題描述

1 個解決方案

解決方案1
3 已采納 2021-04-07 16:09:16

pyspark - 使用 ArrayType 列折疊和求和

問題描述

1 個解決方案

解決方案1 3 已采納 2021-04-07 16:09:16

解決方案1
3 已采納 2021-04-07 16:09:16