[英]spark how to reduce by column of data type is date
我正在一个DataFrame上看起来如下:
-------------------------------
| time | value |
-------------------------------
| 2014-12-01 02:54:00 | 2 |
| 2014-12-01 03:54:00 | 3 |
| 2014-12-01 04:54:00 | 4 |
| 2014-12-01 05:54:00 | 5 |
| 2014-12-02 02:54:00 | 6 |
| 2014-12-02 02:54:00 | 7 |
| 2014-12-03 02:54:00 | 8 |
-------------------------------
每天的样本数量非常随机。
我想每天只获取一个样本,例如:
-------------------------------
| time | value |
-------------------------------
| 2014-12-01 02:54:00 | 2 |
| 2014-12-02 02:54:00 | 6 |
| 2014-12-03 02:54:00 | 8 |
-------------------------------
我不在乎从一天中获得哪个样本,但是我想确保获得一个样本,因此“时间”列上没有一天重复。
您可以先创建一个日期列,然后根据该date
列创建dropDuplicates
; pyspark
示例,如果您使用的是scala
或java
,则语法应相似:
import pyspark.sql.functions as f
df.withColumn('date', f.to_date('time', 'yyyy-MM-dd HH:mm:ss')) \
.dropDuplicates(['date']).drop('date').show()
+-------------------+-----+
| time|value|
+-------------------+-----+
|2014-12-02 02:54:00| 6|
|2014-12-03 02:54:00| 8|
|2014-12-01 02:54:00| 2|
+-------------------+-----+
您可以使用窗口函数,通过对日期值进行分区来生成row_number,并根据row_number = 1进行过滤
看一下这个:
val df = Seq(("2014-12-01 02:54:00","2"),("2014-12-01 03:54:00","3"),("2014-12-01 04:54:00","4"),("2014-12-01 05:54:00","5"),("2014-12-02 02:54:00","6"),("2014-12-02 02:54:00","7"),("2014-12-03 02:54:00","8"))
.toDF("time","value")
df.withColumn("time",'time.cast("timestamp")).withColumn("value",'value.cast("int"))
df.createOrReplaceTempView("timetab")
spark.sql(
""" with order_ts( select time, value , row_number() over(partition by date_format(time,"yyyyMMdd") order by value ) as rn from timetab)
select time,value from order_ts where rn=1
""").show(false)
输出:
+-------------------+-----+
|time |value|
+-------------------+-----+
|2014-12-02 02:54:00|6 |
|2014-12-01 02:54:00|2 |
|2014-12-03 02:54:00|8 |
+-------------------+-----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.