简体   繁体   中英

Spark - aggregated column disappears from a DataFrame after join

I wanted to count the number of items for each sale_id and decided to use a count function. The idea was to have item_numbers as the last column and not to affect the original columns ordering from salesDf .

But after the join sale_id column became the first one in df3 . So in order to fix this I tried .select(salesDf.schema.fieldNames.map(col):_*) However after that item_numbers column is missing (while other columns ordering is correct).

How do I preserve the correct ordering leaving item_numbers column in place at the same time?

 val df2 = salesDf.groupBy("sale_id").agg(count("item_id").as("item_numbers"))
 val df3 = salesDf.join(df2, "sale_id").select(salesDf.schema.fieldNames.map(col):_*)

To preserve salesDf 's column order in the final result, you could assemble the column list for select as follows:

val df2 = salesDf.groupBy("sale_id").agg(count("item_id").as("item_numbers"))
val df3 = salesDf.join(df2, "sale_id")

val orderedCols = salesDf.columns :+ "item_numbers"
val resultDF = df3.select(orderedCols.map(col): _*)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM