[英]How to add a sort condition to a Spark Dataframe
我有一个这样的数据框
+-------------------+----+
|DATE |CODE|
+-------------------+----+
|2015/02/30-14:32:32|xv |
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2015/01/30-10:45:16|val2|
|2016/02/30-07:45:26|cv |
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val3|
|2015/01/30-10:45:16|val1|
|2015/11/30-04:45:19|sd |
|2015/05/23-10:32:16|val2|
|2016/09/30-14:45:58|cv |
|2015/08/30-15:45:00|rt |
|2016/01/30-10:35:31|cv |
|2016/06/30-20:35:30|xv |
|2015/05/23-10:32:16|val1|
|2016/07/19-22:05:48|rt |
+-------------------+----+
我使用此代码按日期排序我的示例
val df = sc.parallelize(Seq(
("2015/02/30-14:32:32", "xv"),
("2016/02/30-12:50:11", "val2"),
("2016/02/30-12:50:11", "val2"),
("2016/02/30-12:50:11", "val2"),
("2015/01/30-10:45:16", "val2"),
("2016/02/30-07:45:26", "cv"),
("2016/02/30-12:50:11", "val1"),
("2016/02/30-12:50:11", "val1"),
("2016/02/30-12:50:11", "val1"),
("2016/02/30-12:50:11", "val3"),
("2015/01/30-10:45:16", "val3"),
("2015/11/30-04:45:19", "sd"),
("2015/05/23-10:32:16", "val2"),
("2016/09/30-14:45:58", "cv"),
("2015/08/30-15:45:00", "rt"),
("2016/01/30-10:35:31", "cv"),
("2016/06/30-20:35:30", "xv"),
("2015/05/23-10:32:16", "val1"),
("2016/07/19-22:05:48", "rt")
)).toDF("DATE", "CODE")
val df_sorted = df.sort("DATE")
df_sorted show false
我得到这个结果:
+-------------------+----+
|DATE |CODE|
+-------------------+----+
|2015/01/30-10:45:16|val3|
|2015/01/30-10:45:16|val2|
|2015/02/30-14:32:32|xv |
|2015/05/23-10:32:16|val2|
|2015/05/23-10:32:16|val1|
|2015/08/30-15:45:00|rt |
|2015/11/30-04:45:19|sd |
|2016/01/30-10:35:31|cv |
|2016/02/30-07:45:26|cv |
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/06/30-20:35:30|xv |
|2016/07/19-22:05:48|rt |
|2016/09/30-14:45:58|cv |
+-------------------+----+
我想添加一个排序条件。 我希望我的所有代码都按以下顺序以 val 开头:val2、val1、val3,如果它们具有相同的日期 YYYY/MM/DD-hh:mm:ss 并得到以下结果:
+-------------------+----+
|DATE |CODE|
+-------------------+----+
|2015/01/30-10:45:16|val2|
|2015/01/30-10:45:16|val1|
|2015/02/30-14:32:32|xv |
|2015/05/23-10:32:16|val2|
|2015/05/23-10:32:16|val1|
|2015/08/30-15:45:00|rt |
|2015/11/30-04:45:19|sd |
|2016/01/30-10:35:31|cv |
|2016/02/30-07:45:26|cv |
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val3|
|2016/06/30-20:35:30|xv |
|2016/07/19-22:05:48|rt |
|2016/09/30-14:45:58|cv |
+-------------------+----+
你有什么主意吗?
假设 sc 在 hiveContext 中,如果不首先将 sparkContext 包装在 hive 上下文中。
df.registerTempTable("MY_TEMP_TABLE);
val sortedDF = sc.sql("SELECT * FROM MY_TEMP_TABLE ORDER BY DATE ASC, CODE DESC");
sortedDF.show
或者您想要运行的任何版本的 SQL 排序。
您可以按多列排序:
val df_sorted2 = df.sort("DATE","CODE")
df_sorted2.show()
这给了我:
+-------------------+----+
| DATE|CODE|
+-------------------+----+
|2015/01/30-10:45:16|val1|
|2015/01/30-10:45:16|val2|
|2015/02/30-14:32:32| xv|
|2015/05/23-10:32:16|val1|
|2015/05/23-10:32:16|val2|
|2015/08/30-15:45:00| rt|
|2015/11/30-04:45:19| sd|
|2016/01/30-10:35:31| cv|
|2016/02/30-07:45:26| cv|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/06/30-20:35:30| xv|
|2016/07/19-22:05:48| rt|
|2016/09/30-14:45:58| cv|
+-------------------+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.