collect_list（）是否保持行的相对顺序？

Question

Imagine that I have the following DataFrame df: 假设我有以下DataFrame df：

+---+-----------+------------+
| id|featureName|featureValue|
+---+-----------+------------+
|id1|          a|           3|
|id1|          b|           4|
|id2|          a|           2|
|id2|          c|           5|
|id3|          d|           9|
+---+-----------+------------+

Imagine that I run: 想象一下我跑步：

df.groupBy("id")
  .agg(collect_list($"featureIndex").as("idx"),
       collect_list($"featureValue").as("val"))

Am I GUARANTEED that "idx" and "val" will be aggregated and keep their relative order? 我是否保证 “ idx”和“ val”将被汇总并保持其相对顺序？ ie 即

GOOD                   GOOD                   BAD
+---+------+------+    +---+------+------+    +---+------+------+
| id|   idx|   val|    | id|   idx|   val|    | id|   idx|   val|
+---+------+------+    +---+------+------+    +---+------+------+
|id3|   [d]|   [9]|    |id3|   [d]|   [9]|    |id3|   [d]|   [9]|
|id1|[a, b]|[3, 4]|    |id1|[b, a]|[4, 3]|    |id1|[a, b]|[4, 3]|
|id2|[a, c]|[2, 5]|    |id2|[c, a]|[5, 2]|    |id2|[a, c]|[5, 2]|
+---+------+------+    +---+------+------+    +---+------+------+

NOTE: eg It's BAD because for id1 [a, b] should have been associated with [3, 4] (and not [4, 3]). 注意：例如，这是错误的，因为对于id1 [a，b]应该已经与[3，4]（而不是[4，3]）相关联。 Same for id2 与ID2相同

Answer 1

I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed). 我认为您可以依靠“它们的相对顺序”，因为Spark会按顺序逐行遍历行（如果不需要明确的话，通常不会对行进行重新排序）。

If you are concerned with the order, merge these two columns using struct function before doing groupBy . 如果您担心顺序，请在执行groupBy之前使用struct函数合并这两列。

struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns. struct（colName：String，colNames：String *）：列创建一个由多个输入列组成的新struct列。

You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using struct ): 您还可以使用monotonically_increasing_id函数对记录进行编号，并将其与其他列配对（也许使用struct ）：

monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers. monotonically_increasing_id（）：列生成单调递增的64位整数的列表达式。

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. 保证生成的ID单调递增且唯一，但不连续。

collect_list（）是否保持行的相对顺序？

问题描述

1 个解决方案

解决方案1
11 已采纳 2017-06-09 04:12:00

collect_list（）是否保持行的相对顺序？

问题描述

1 个解决方案

解决方案1 11 已采纳 2017-06-09 04:12:00

解决方案1
11 已采纳 2017-06-09 04:12:00