繁体   English   中英

如何向 Spark Dataframe 添加排序条件

[英]How to add a sort condition to a Spark Dataframe

我有一个这样的数据框

+-------------------+----+
|DATE               |CODE|
+-------------------+----+
|2015/02/30-14:32:32|xv  |
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2015/01/30-10:45:16|val2|
|2016/02/30-07:45:26|cv  |
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val3|
|2015/01/30-10:45:16|val1|
|2015/11/30-04:45:19|sd  |
|2015/05/23-10:32:16|val2|
|2016/09/30-14:45:58|cv  |
|2015/08/30-15:45:00|rt  |
|2016/01/30-10:35:31|cv  |
|2016/06/30-20:35:30|xv  |
|2015/05/23-10:32:16|val1|
|2016/07/19-22:05:48|rt  |
+-------------------+----+

我使用此代码按日期排序我的示例

val df = sc.parallelize(Seq(
  ("2015/02/30-14:32:32", "xv"),
  ("2016/02/30-12:50:11", "val2"),
  ("2016/02/30-12:50:11", "val2"),
  ("2016/02/30-12:50:11", "val2"),
  ("2015/01/30-10:45:16", "val2"),
  ("2016/02/30-07:45:26", "cv"),
  ("2016/02/30-12:50:11", "val1"),
  ("2016/02/30-12:50:11", "val1"),
  ("2016/02/30-12:50:11", "val1"),
  ("2016/02/30-12:50:11", "val3"),
  ("2015/01/30-10:45:16", "val3"),
  ("2015/11/30-04:45:19", "sd"),
  ("2015/05/23-10:32:16", "val2"),
  ("2016/09/30-14:45:58", "cv"),
  ("2015/08/30-15:45:00", "rt"),
  ("2016/01/30-10:35:31", "cv"),
  ("2016/06/30-20:35:30", "xv"),
  ("2015/05/23-10:32:16", "val1"),
  ("2016/07/19-22:05:48", "rt")
)).toDF("DATE", "CODE")

val df_sorted = df.sort("DATE")

df_sorted show false

我得到这个结果:

+-------------------+----+
|DATE               |CODE|
+-------------------+----+
|2015/01/30-10:45:16|val3|
|2015/01/30-10:45:16|val2|
|2015/02/30-14:32:32|xv  |
|2015/05/23-10:32:16|val2|
|2015/05/23-10:32:16|val1|
|2015/08/30-15:45:00|rt  |
|2015/11/30-04:45:19|sd  |
|2016/01/30-10:35:31|cv  |
|2016/02/30-07:45:26|cv  |
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/06/30-20:35:30|xv  |
|2016/07/19-22:05:48|rt  |
|2016/09/30-14:45:58|cv  |
+-------------------+----+

我想添加一个排序条件。 我希望我的所有代码都按以下顺序以 val 开头:val2、val1、val3,如果它们具有相同的日期 YYYY/MM/DD-hh:mm:ss 并得到以下结果:

+-------------------+----+
|DATE               |CODE|
+-------------------+----+
|2015/01/30-10:45:16|val2|
|2015/01/30-10:45:16|val1|
|2015/02/30-14:32:32|xv  |
|2015/05/23-10:32:16|val2|
|2015/05/23-10:32:16|val1|
|2015/08/30-15:45:00|rt  |
|2015/11/30-04:45:19|sd  |
|2016/01/30-10:35:31|cv  |
|2016/02/30-07:45:26|cv  |
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val3|
|2016/02/30-12:50:11|val3|
|2016/06/30-20:35:30|xv  |
|2016/07/19-22:05:48|rt  |
|2016/09/30-14:45:58|cv  |
+-------------------+----+

你有什么主意吗?

假设 sc 在 hiveContext 中,如果不首先将 sparkContext 包装在 hive 上下文中。

df.registerTempTable("MY_TEMP_TABLE);

val sortedDF = sc.sql("SELECT * FROM MY_TEMP_TABLE ORDER BY DATE ASC, CODE DESC");
sortedDF.show

或者您想要运行的任何版本的 SQL 排序。

您可以按多列排序:

 val df_sorted2 = df.sort("DATE","CODE")
 df_sorted2.show()

这给了我:

+-------------------+----+
|               DATE|CODE|
+-------------------+----+
|2015/01/30-10:45:16|val1|
|2015/01/30-10:45:16|val2|
|2015/02/30-14:32:32|  xv|
|2015/05/23-10:32:16|val1|
|2015/05/23-10:32:16|val2|
|2015/08/30-15:45:00|  rt|
|2015/11/30-04:45:19|  sd|
|2016/01/30-10:35:31|  cv|
|2016/02/30-07:45:26|  cv|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val1|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/02/30-12:50:11|val2|
|2016/06/30-20:35:30|  xv|
|2016/07/19-22:05:48|  rt|
|2016/09/30-14:45:58|  cv|
+-------------------+----+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM