简体   繁体   English

在Spark SQL中的一个查询中使用多个collect_list

[英]Use more than one collect_list in one query in Spark SQL

I have the following dataframe data : 我有以下数据帧data

root
 |-- userId: string 
 |-- product: string 
 |-- rating: double

and the following query: 以及以下查询:

val result = sqlContext.sql("select userId, collect_list(product), collect_list(rating) from data group by userId")

My question is that, does product and rating in the aggregated arrays match each other? 我的问题是,聚合数组中的productrating相互匹配? That is, whether the product and the rating from the same row have the same index in the aggregated arrays. 也就是说, product和来自同一行的rating在聚合数组中是否具有相同的索引。

Update: Starting from Spark 2.0.0, one can do collect_list on struct type so we can do one collect_list on a combined column. 更新:从Spark 2.0.0开始,可以对struct类型执行collect_list ,这样我们就可以在组合列上执行一个collect_list But for pre 2.0.0 version, one can only use collect_list on primitive type. 但是对于2.0.0之前的版本,人们只能在原始类型上使用collect_list

I believe there is no explicit guarantee that all arrays will have the same order. 我相信没有明确保证所有阵列都具有相同的顺序。 Spark SQL uses multiple optimizations and under certain conditions there is no guarantee that all aggregations are scheduled at the same time (one example is aggregation with DISTINCT ). Spark SQL使用多个优化,并且在某些条件下无法保证所有聚合同时进行调度(一个示例是使用DISTINCT聚合)。 Since exchange (shuffle) results in nondeterministic order it is theoretically possible that order will differ. 由于交换(混洗)导致不确定的顺序,理论上订单可能会有所不同。

So while it should work in practice it could be risky and introduce some hard to detect bugs. 因此,尽管它应该在实践中起作用,但它可能存在风险并且会引入一些难以发现的错误。

If you Spark 2.0.0 or later you can aggregate non-atomic columns with collect_list : 如果您使用Spark 2.0.0或更高版本,则可以使用collect_list聚合非原子列:

SELECT userId, collect_list(struct(product, rating)) FROM data GROUP BY userId

If you use an earlier version you can try to use explicit partitions and order: 如果您使用的是早期版本,则可以尝试使用显式分区并订购:

WITH tmp AS (
  SELECT * FROM data DISTRIBUTE BY userId SORT BY userId, product, rating
)
SELECT userId, collect_list(product), collect_list(rating)
FROM tmp
GROUP BY userId

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM