[英]Spark - aggregated column disappears from a DataFrame after join
I wanted to count the number of items for each sale_id
and decided to use a count function. 我想计算每个
sale_id
的商品数量,并决定使用count函数。 The idea was to have item_numbers
as the last column and not to affect the original columns ordering from salesDf
. 这个想法是让
item_numbers
作为最后一列,而不影响从salesDf
排序的原始列。
But after the join sale_id
column became the first one in df3
. 但是在加入后,
sale_id
列成为df3
的第一列。 So in order to fix this I tried .select(salesDf.schema.fieldNames.map(col):_*)
However after that item_numbers
column is missing (while other columns ordering is correct). 因此,为了解决此问题,我尝试使用
.select(salesDf.schema.fieldNames.map(col):_*)
但是之后缺少item_numbers
列(而其他列的排序是正确的)。
How do I preserve the correct ordering leaving item_numbers
column in place at the same time? 如何保留正确的排序,同时保留
item_numbers
列?
val df2 = salesDf.groupBy("sale_id").agg(count("item_id").as("item_numbers"))
val df3 = salesDf.join(df2, "sale_id").select(salesDf.schema.fieldNames.map(col):_*)
To preserve salesDf
's column order in the final result, you could assemble the column list for select
as follows: 要在最终结果中保留
salesDf
的列顺序,可以按如下方式组合select
的列列表:
val df2 = salesDf.groupBy("sale_id").agg(count("item_id").as("item_numbers"))
val df3 = salesDf.join(df2, "sale_id")
val orderedCols = salesDf.columns :+ "item_numbers"
val resultDF = df3.select(orderedCols.map(col): _*)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.