[英]pyspark - fold and sum with ArrayType column
I'm trying to do an element-wise sum, and I've created this dummy df.我正在尝试按元素求和,并且我创建了这个虚拟 df。 The output should be
[10,4,4,1]
output 应该是
[10,4,4,1]
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, ArrayType
data = [
("James",[1,1,1,1]),
("James",[2,1,1,0]),
("James",[3,1,1,0]),
("James",[4,1,1,0])
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("scores", ArrayType(IntegerType()), True) \
])
df = spark.createDataFrame(data=data,schema=schema)
posexplode works, but my real df is too large so I'm trying to use fold, but it gives me an error. posexplode 有效,但我的真实 df 太大,所以我尝试使用 fold,但它给了我一个错误。 Any ideas?
有任何想法吗? Thanks!
谢谢!
vec_df = df.select("scores")
vec_sums = vec_df.rdd.fold([0]*4, lambda a,b: [x + y for x, y in zip(a, b)])
File "<ipython-input-115-9b470dedcfef>", line 2, in <listcomp>
<listcomp> 中的文件“<ipython-input-115-9b470dedcfef>”,第 2 行
TypeError: unsupported operand type(s) for +: 'int' and 'list'
类型错误:+ 不支持的操作数类型:“int”和“list”
You need to map the RDD of rows to an RDD of lists before fold
:您需要 map 在
fold
之前将行的 RDD 转换为列表的 RDD:
vec_sums = vec_df.rdd.map(lambda x: x[0]).fold([0]*4, lambda a,b: [x + y for x, y in zip(a, b)])
To help understanding, you can see how the RDDs look like.为了帮助理解,您可以查看 RDD 的外观。
>>> vec_df.rdd.collect()
[Row(scores=[1, 1, 1, 1]), Row(scores=[2, 1, 1, 0]), Row(scores=[3, 1, 1, 0]), Row(scores=[4, 1, 1, 0])]
>>> vec_df.rdd.map(lambda x: x[0]).collect()
[[1, 1, 1, 1], [2, 1, 1, 0], [3, 1, 1, 0], [4, 1, 1, 0]]
So you can imagine that vec_df.rdd
contains a nested list, which needs to be unnested before fold
.所以你可以想象
vec_df.rdd
包含一个嵌套列表,需要在fold
之前取消嵌套。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.