pyspark向前填充具有特定值的时间戳列（1秒）

Question

in the past I've asked this question regarding python pandas library: pandas forward fill Time Stamp columns with specific value (1 second) 过去我曾问过有关python pandas库的问题： pandas使用特定值（1秒）正向填充Time Stamp列

But now I will be doing huge data processing in pyspark so would ask for another solution in pyspark: 但是现在我将在pyspark中进行海量数据处理，因此会在pyspark中寻求另一种解决方案：

I have a spark DataFrame: 我有一个火花DataFrame：

df = spark.createDataFrame([Row(a=1, b='2018-09-26 04:38:32.544', c='11', d='foo'),
                            Row(a=2, b='', c='22', d='bar'),
                            Row(a=3, b='', c='33', d='foo'),
                            Row(a=4, b='', c='44', d='bar'),
                            Row(a=5, b='2018-09-26 04:58:32.544', c='55', d='foo'),
                            Row(a=6, b='', c='66', d='bar')])
df.show(truncate=False)

|a  |b                      |c  |d  |
+---+-----------------------+---+---+
|1  |2018-09-26 04:38:32.544|11 |foo|
|2  |                       |22 |bar|
|3  |                       |33 |foo|
|4  |                       |44 |bar|
|5  |2018-09-26 04:58:32.544|55 |foo|
|6  |                       |66 |bar|
+---+-----------------------+---+---+

And I would like to add consecutively 1 second to each NaT from the previous available: 我想从以前的可用内容中连续向每个NaT添加1秒：

|a  |b                      |c  |d  |
+---+-----------------------+---+---+
|1  |2018-09-26 04:38:32.544|11 |foo|
|2  |2018-09-26 04:39:32.544|22 |bar|
|3  |2018-09-26 04:40:32.544|33 |foo|
|4  |2018-09-26 04:41:32.544|44 |bar|
|5  |2018-09-26 04:58:32.544|55 |foo|
|6  |2018-09-26 04:59:32.544|66 |bar|
+---+-----------------------+---+---+

I've read that udf's should be avoided as they will slow down the processing on millions of rows. 我读到应该避免使用udf，因为它们会减慢数百万行的处理速度。 Thanks for help! 感谢帮助！

UPDATE 2019/09/09 更新2019/09/09

After talking with @cronoik below there is a case study where one column d is for partitioning the dataset: 在下面与@cronoik交谈之后，有一个案例研究，其中一列d用于分区数据集：

df2 = spark.createDataFrame([Row(a=1, b='2018-09-26 04:38:32.544', c='11', d='foo'),
                             Row(a=2, b='', c='22', d='foo'),
                             Row(a=3, b='', c='33', d='foo'),
                             Row(a=4, b='', c='44', d='foo'),
                             Row(a=5, b='2018-09-26 04:58:32.544', c='55', d='foo'),
                             Row(a=6, b='', c='66', d='foo'),
                             Row(a=1, b='2018-09-28 05:40:32.544', c='111', d='bar'),
                             Row(a=2, b='', c='222', d='bar'),
                             Row(a=3, b='2018-09-28 05:49:32.544', c='333', d='bar'),
                             Row(a=4, b='', c='444', d='bar'),
                             Row(a=5, b='2018-09-28 05:55:32.544', c='555', d='bar'),
                             Row(a=6, b='', c='666', d='bar')]

|a  |b                      |c  |d  |
+---+-----------------------+---+---+
|1  |2018-09-26 04:38:32.544|11 |foo|
|2  |                       |22 |foo|
|3  |                       |33 |foo|
|4  |                       |44 |foo|
|5  |2018-09-26 04:58:32.544|55 |foo|
|6  |                       |66 |foo|
|1  |2018-09-28 05:40:32.544|111|bar|
|2  |                       |222|bar|
|3  |2018-09-28 05:49:32.544|333|bar|
|4  |                       |444|bar|
|5  |2018-09-28 05:55:32.544|555|bar|
|6  |                       |666|bar|
+---+-----------------------+---+---+

Answer 1

This is probably not the most efficient solution as we can't partition the dataframe according to your requirements. 这可能不是最有效的解决方案，因为我们无法根据您的要求对数据帧进行分区。 That means that all the data is loaded to a single partition and ordered there. 这意味着所有数据都被加载到单个分区并在那里排序。 Maybe someone can come up with a better solution. 也许有人可以提出更好的解决方案。

The code below use a lag window function which returns the value of the previous row. 下面的代码使用一个滞后窗口函数，该函数返回前一行的值。 We apply this only when the current value for b is null otherwise we keep the current value. 我们仅在b的当前值为null时应用此方法，否则我们将保留当前值。 When the current value is null, we add one second to the value of the previous row. 当当前值为空时，我们将上一行的值增加一秒钟。 We have to do this several times as a row which contains null in the b column and the previous row which also contains null in the 'b' column will get null returned from lag (ie lag is not applied consecutively and therefore we have to do this by ourself). 我们必须这样做几次，因为在b列中包含null的行，并且在'b'列中也包含null的前一行将从lag返回空值（即lag不连续应用，因此我们必须这样做这是我们自己的）。

import pyspark.sql.functions as F
from pyspark.sql import Row
from pyspark.sql import Window

df = spark.createDataFrame([Row(a=1, b='2018-09-26 04:38:32.544', c='11', d='foo'),
                            Row(a=2, b='', c='22', d='bar'),
                            Row(a=3, b='', c='33', d='foo'),
                            Row(a=4, b='', c='44', d='bar'),
                            Row(a=5, b='2018-09-26 04:58:32.544', c='55', d='foo'),
                            Row(a=6, b='', c='66', d='bar')])

df = df.withColumn('a',  df.a.cast("int"))
df = df.withColumn('b',  df.b.cast("timestamp"))

w = Window.orderBy('a')

while df.filter(df.b.isNull()).count() != 0:
    df = df.withColumn('b', F.when(df.b.isNotNull(), df.b).otherwise(F.lag('b').over(w)  + F.expr('INTERVAL 1 SECONDS')))

df.show(truncate=False)

Output: 输出：

+---+-----------------------+---+---+ 
| a |                     b | c | d | 
+---+-----------------------+---+---+ 
| 1 |2018-09-26 04:38:32.544|11 |foo| 
| 2 |2018-09-26 04:38:33.544|22 |bar| 
| 3 |2018-09-26 04:38:34.544|33 |foo| 
| 4 |2018-09-26 04:38:35.544|44 |bar| 
| 5 |2018-09-26 04:58:32.544|55 |foo| 
| 6 |2018-09-26 04:58:33.544|66 |bar| 
+---+-----------------------+---+---+

UPDATE 2019/09/09 更新2019/09/09
In your edit you said that the column d can be used as partition key. 在您的编辑中，您说d列可用作分区键。 There is not much you have to change for partitioning. 您不必为分区进行太多更改。 Just replace w = Window.orderBy('a') with w = Window.partitionBy('d').orderBy('a') and spark will generate a partition for each distinct value of d and execute the code in parallel for each partition. 只需将w = Window.orderBy('a')替换为w = Window.partitionBy('d').orderBy('a') ，spark将为d的每个不同值生成一个分区，并对每个d并行执行代码划分。

pyspark向前填充具有特定值的时间戳列（1秒）

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-09 23:01:54

pyspark向前填充具有特定值的时间戳列（1秒）

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-09 23:01:54

解决方案1
1 已采纳 2019-07-09 23:01:54