[英]Does collect_list() maintain relative ordering of rows?
Imagine that I have the following DataFrame df: 假设我有以下DataFrame df:
+---+-----------+------------+
| id|featureName|featureValue|
+---+-----------+------------+
|id1| a| 3|
|id1| b| 4|
|id2| a| 2|
|id2| c| 5|
|id3| d| 9|
+---+-----------+------------+
Imagine that I run: 想象一下我跑步:
df.groupBy("id")
.agg(collect_list($"featureIndex").as("idx"),
collect_list($"featureValue").as("val"))
Am I GUARANTEED that "idx" and "val" will be aggregated and keep their relative order? 我是否保证 “ idx”和“ val”将被汇总并保持其相对顺序? ie
即
GOOD GOOD BAD
+---+------+------+ +---+------+------+ +---+------+------+
| id| idx| val| | id| idx| val| | id| idx| val|
+---+------+------+ +---+------+------+ +---+------+------+
|id3| [d]| [9]| |id3| [d]| [9]| |id3| [d]| [9]|
|id1|[a, b]|[3, 4]| |id1|[b, a]|[4, 3]| |id1|[a, b]|[4, 3]|
|id2|[a, c]|[2, 5]| |id2|[c, a]|[5, 2]| |id2|[a, c]|[5, 2]|
+---+------+------+ +---+------+------+ +---+------+------+
NOTE: eg It's BAD because for id1 [a, b] should have been associated with [3, 4] (and not [4, 3]). 注意:例如,这是错误的,因为对于id1 [a,b]应该已经与[3,4](而不是[4,3])相关联。 Same for id2
与ID2相同
I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed). 我认为您可以依靠“它们的相对顺序”,因为Spark会按顺序逐行遍历行(如果不需要明确的话, 通常不会对行进行重新排序)。
If you are concerned with the order, merge these two columns using struct function before doing groupBy
. 如果您担心顺序,请在执行
groupBy
之前使用struct函数合并这两列。
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
struct(colName:String,colNames:String *):列创建一个由多个输入列组成的新struct列。
You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using struct
): 您还可以使用monotonically_increasing_id函数对记录进行编号,并将其与其他列配对(也许使用
struct
):
monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers.
monotonically_increasing_id():列生成单调递增的64位整数的列表达式。
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
保证生成的ID单调递增且唯一,但不连续。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.