Scala Spark数据框联接结果未按首选顺序排列

Question

I'm have a dataframe called stores_df that contains store information such as date and sales. 我有一个名为stores_df的数据框，其中包含商店信息，例如日期和销售。 I have another dataframe called avg_sales_store_by_month that contains the average sales for each month of each store. 我还有一个称为avg_sales_store_by_month的数据框，其中包含每个商店每个月的平均销售额。 I wish to get the average sales column from this to append it to stores_df. 我希望从中获取平均销售列，以将其附加到stores_df。 The issue i have is after my join, the order of stores_df gets changed 我遇到的问题是加入后，stores_df的顺序已更改

Below is the first few rows from stores_df. 以下是stores_df的前几行。

+-----+----------+---------+----+------------+-----------+----------+-----------+------------+-----+----+---+
|Store|      Date|IsHoliday|Dept|Weekly_Sales|Temperature|Fuel_Price|        CPI|Unemployment|Month|Year|Day|
+-----+----------+---------+----+------------+-----------+----------+-----------+------------+-----+----+---+
|    1|2010-02-05|    FALSE|   1|       24924|      42.31|     2.572|211.0963582|       8.106|    2|2010|  5|
|    1|2010-02-12|     TRUE|   1|       46039|      38.51|     2.548|211.2421698|       8.106|    2|2010| 12|
|    1|2010-02-19|    FALSE|   1|       41595|      39.93|     2.514|211.2891429|       8.106|    2|2010| 19|
|    1|2010-05-14|    FALSE|   1|       18926|      74.78|     2.854|210.3374261|       7.808|    5|2010| 14|
+-----+----------+---------+----+------------+-----------+----------+-----------+------------+-----+----+---+

Below is the first few rows of avg_sales_store_by_month, i wish to grab the last column and append it to the end of stores_df. 以下是avg_sales_store_by_month的前几行，我希望获取最后一列并将其附加到stores_df的末尾。

+-----+-----+------------------+
|Store|Month|avg_sales_by_month|
+-----+-----+------------------+
|   39|   11|          23317.75|
|   43|    7|          13090.84|
|   10|    2|          28407.05|
|   23|    6|           21265.7|
|    4|   10|           28723.2|
|    9|   10|            8468.2|
+-----+-----+------------------+

My issue is when i use my join: 我的问题是使用联接时：

stores_df = stores_df.join( avg_sales_store_by_month, Seq("Store", "Month"), "left" )

The rows of stores_df gets reordered, I would like it to be in the same order as before the join but with the extra column. stores_df的行被重新排序，我希望它的顺序与连接之前的顺序相同，但要有额外的列。 How do i achieve this? 我该如何实现？

After the join snippet, order is messed up. 加入片段后，订单混乱了。

+-----+-----+----------+---------+----+------------+-----------+----------+-----------+------------+----+---+------------------+
|Store|Month|      Date|IsHoliday|Dept|Weekly_Sales|Temperature|Fuel_Price|        CPI|Unemployment|Year|Day|avg_sales_by_month|
+-----+-----+----------+---------+----+------------+-----------+----------+-----------+------------+----+---+------------------+
|   39|   11|2010-11-05|    FALSE|   1|       31729|      61.62|     2.689|210.7202444|       8.476|2010|  5|          23317.75|
|   39|   11|2010-11-12|    FALSE|   1|       12324|      62.21|     2.728|210.7667944|       8.476|2010| 12|          23317.75|
|   39|   11|2010-11-19|    FALSE|   1|       15137|       55.5|     2.771|  210.65429|       8.476|2010| 19|          23317.75|
|   39|   11|2011-11-11|    FALSE|   2|       65758|      63.11|     3.297|216.7217373|       7.716|2011| 11|          23317.75|
|   39|   11|2011-11-18|    FALSE|   2|       70050|      66.09|     3.308|216.9395861|       7.716|2011| 18|          23317.75|
+-----+-----+----------+---------+----+------------+-----------+----------+-----------+------------+----+---+------------------+

Answer 1

If you want to preserve the original column order, you can save the first dataframe's columns along with the additional column in an Array and select them after the join, as in the following example: 如果要保留原始列顺序，则可以将第一个数据框的列以及其他列保存在Array中，并在连接后选择它们，如以下示例所示：

val df1 = Seq(
  (1, 25000, 3, 2010, 20),
  (1, 30000, 3, 2010, 27),
  (1, 20000, 4, 2010, 3),
  (2, 40000, 3, 2010, 20),
  (2, 35000, 3, 2010, 27),
  (2, 35000, 4, 2010, 3)
).toDF("Store", "Wk_Sales", "Month", "year", "Day")

val df2 = Seq(
  (1, 3, 100000),
  (1, 4, 90000),
  (2, 3, 140000),
  (2, 4, 110000)
).toDF("Store", "Month", "Mo_Sales")

val joinedDF = df1.join(df2, Seq("Store", "Month"), "left")
// +-----+-----+--------+----+---+--------+
// |Store|Month|Wk_Sales|year|Day|Mo_Sales|
// +-----+-----+--------+----+---+--------+
// |    1|    3|   25000|2010| 20|  100000|
// |    1|    3|   30000|2010| 27|  100000|
// |    1|    4|   20000|2010|  3|   90000|
// |    2|    3|   40000|2010| 20|  140000|
// |    2|    3|   35000|2010| 27|  140000|
// |    2|    4|   35000|2010|  3|  110000|
// +-----+-----+--------+----+---+--------+

val cols = df1.columns :+ "Mo_Sales"

joinedDF.select(cols.head, cols.tail: _*).
  show
// +-----+--------+-----+----+---+--------+
// |Store|Wk_Sales|Month|year|Day|Mo_Sales|
// +-----+--------+-----+----+---+--------+
// |    1|   25000|    3|2010| 20|  100000|
// |    1|   30000|    3|2010| 27|  100000|
// |    1|   20000|    4|2010|  3|   90000|
// |    2|   40000|    3|2010| 20|  140000|
// |    2|   35000|    3|2010| 27|  140000|
// |    2|   35000|    4|2010|  3|  110000|
// +-----+--------+-----+----+---+--------+

Scala Spark数据框联接结果未按首选顺序排列

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-03-31 00:15:04

Scala Spark数据框联接结果未按首选顺序排列

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-03-31 00:15:04

解决方案1
2 已采纳 2018-03-31 00:15:04