將列值組合添加到 PySpark 中的數據框的更好方法

Question

我有一個包含 3 列的數據集， id ， day ， value 。 我需要為id和day的所有組合添加value為零的行。

# Simplified version of my data frame
data = [("1", "2020-04-01", 5), 
        ("2", "2020-04-01", 5), 
        ("3", "2020-04-02", 4)]
df = spark.createDataFrame(data,['id','day', 'value'])

我想出的是：

# Create all combinations of id and day
ids= df.select('id').distinct()
days = df.select('day').distinct()
full = ids.crossJoin(days)

# Add combinations back to df filling value with zeros
df_full = df.join(full, ['id', 'day'], 'rightouter')\
    .na.fill(value=0,subset=['value'])

哪個輸出我需要的東西：

>>> df_full.orderBy(['id','day']).show()
+---+----------+-----+
| id|       day|value|
+---+----------+-----+
|  1|2020-04-01|    5|
|  1|2020-04-02|    0|
|  2|2020-04-01|    5|
|  2|2020-04-02|    0|
|  3|2020-04-01|    0|
|  3|2020-04-02|    4|
+---+----------+-----+

問題是這兩種操作的計算量都非常大。 當我使用完整數據運行它時，它為我提供的工作比通常需要幾個小時才能運行的工作大一個數量級。

有沒有更有效的方法來做到這一點？ 還是我缺少什么？

Answer 1

這就是我要實施的方式。 只是一點，兩個數據幀必須具有相同的架構，否則stack function 將引發錯誤

import pyspark.sql.functions as f


# Simplified version of my data frame
data = [("1", "2020-04-01", 5), 
        ("2", "2020-04-01", 5), 
        ("3", "2020-04-02", 4)]
df = spark.createDataFrame(data, ['id', 'day', 'value'])

# Creating a dataframe with all distinct days
df_days = df.select(f.col('day').alias('r_day')).distinct()

# Self Join to find all combinations
df_final = df.join(df_days, on=df['day'] != df_days['r_day'])
# +---+----------+-----+----------+
# | id|       day|value|     r_day|
# +---+----------+-----+----------+
# |  1|2020-04-01|    5|2020-04-02|
# |  2|2020-04-01|    5|2020-04-02|
# |  3|2020-04-02|    4|2020-04-01|
# +---+----------+-----+----------+

# Unpivot dataframe
df_final = df_final.select('id', f.expr('stack(2, day, value, r_day, cast(0 as bigint)) as (day, value)'))
df_final.orderBy('id', 'day').show()

Output：

+---+----------+-----+
| id|       day|value|
+---+----------+-----+
|  1|2020-04-01|    5|
|  1|2020-04-02|    0|
|  2|2020-04-01|    5|
|  2|2020-04-02|    0|
|  3|2020-04-01|    0|
|  3|2020-04-02|    4|
+---+----------+-----+

Answer 2

像這樣的東西。 你可以，我將第一行分開，因為它更清楚發生了什么。 不過，您可以將其添加到“主循環”中。

data = [
    ("1", date(2020, 4, 1), 5),
    ("2", date(2020, 4, 2), 5),
    ("3", date(2020, 4, 3), 5),
    ("1", date(2020, 4, 3), 5),
]


df = spark.createDataFrame(data, ["id", "date", "value"])

row_dates = df.select("date").distinct().collect()

dates = [item.asDict()["date"] for item in row_dates]


def map_row(dates: List[date]) -> Callable[[Iterator[Row]], Iterator[Row]]:
    dates.sort()

    def inner(partition):
        last_row = None

        for row in partition:
            # fill in missing dates for first row in partition
            if last_row is None:
                for day in dates:
                    if day < row.date:
                        yield Row(row.id, day, 0)
                    else:
                        # set current row as last row, yield current row and break out of the loop
                        last_row = row
                        yield row
                        break
            else:
                # if current row has same id as last row
                if last_row.id == row.id:
                    # yield dates between last and current
                    for day in dates:
                        if day > last_row.date and day < row.date:
                            yield Row(row.id, day, 0)
                    
                    # set current as last and yield current
                    last_row = row
                    yield row

                else:
                    # if current row is new id
                    for day in dates:
                        # run potential remaining dates for last_row.id
                        if day > last_row.date:
                            yield Row(last_row.id, day, 0)

                    for day in dates:
                        # fill in missing dates before row.date
                        if day < row.date:
                            yield Row(row.id, day, 0)                    
                        else:
                            # unt so weiter
                            last_row = row
                            yield row
                            break

    return inner


rdd = (
    df.repartition(1, "id")
    .sortWithinPartitions("id", "date")
    .rdd.mapPartitions(map_row(dates))
)
new_df = spark.createDataFrame(rdd)
new_df.show(10, False)

將列值組合添加到 PySpark 中的數據框的更好方法

問題描述

2 個解決方案

解決方案1
5 2021-05-18 23:44:36

解決方案2
0 2021-05-29 10:14:30

將列值組合添加到 PySpark 中的數據框的更好方法

問題描述

2 個解決方案

解決方案1 5 2021-05-18 23:44:36

解決方案2 0 2021-05-29 10:14:30

解決方案1
5 2021-05-18 23:44:36

解決方案2
0 2021-05-29 10:14:30