[英]Better way to add column values combinations to data frame in PySpark
我有一個包含 3 列的數據集, id
, day
, value
。 我需要為id
和day
的所有組合添加value
為零的行。
# Simplified version of my data frame
data = [("1", "2020-04-01", 5),
("2", "2020-04-01", 5),
("3", "2020-04-02", 4)]
df = spark.createDataFrame(data,['id','day', 'value'])
我想出的是:
# Create all combinations of id and day
ids= df.select('id').distinct()
days = df.select('day').distinct()
full = ids.crossJoin(days)
# Add combinations back to df filling value with zeros
df_full = df.join(full, ['id', 'day'], 'rightouter')\
.na.fill(value=0,subset=['value'])
哪個輸出我需要的東西:
>>> df_full.orderBy(['id','day']).show()
+---+----------+-----+
| id| day|value|
+---+----------+-----+
| 1|2020-04-01| 5|
| 1|2020-04-02| 0|
| 2|2020-04-01| 5|
| 2|2020-04-02| 0|
| 3|2020-04-01| 0|
| 3|2020-04-02| 4|
+---+----------+-----+
問題是這兩種操作的計算量都非常大。 當我使用完整數據運行它時,它為我提供的工作比通常需要幾個小時才能運行的工作大一個數量級。
有沒有更有效的方法來做到這一點? 還是我缺少什么?
這就是我要實施的方式。 只是一點,兩個數據幀必須具有相同的架構,否則stack
function 將引發錯誤
import pyspark.sql.functions as f
# Simplified version of my data frame
data = [("1", "2020-04-01", 5),
("2", "2020-04-01", 5),
("3", "2020-04-02", 4)]
df = spark.createDataFrame(data, ['id', 'day', 'value'])
# Creating a dataframe with all distinct days
df_days = df.select(f.col('day').alias('r_day')).distinct()
# Self Join to find all combinations
df_final = df.join(df_days, on=df['day'] != df_days['r_day'])
# +---+----------+-----+----------+
# | id| day|value| r_day|
# +---+----------+-----+----------+
# | 1|2020-04-01| 5|2020-04-02|
# | 2|2020-04-01| 5|2020-04-02|
# | 3|2020-04-02| 4|2020-04-01|
# +---+----------+-----+----------+
# Unpivot dataframe
df_final = df_final.select('id', f.expr('stack(2, day, value, r_day, cast(0 as bigint)) as (day, value)'))
df_final.orderBy('id', 'day').show()
Output:
+---+----------+-----+
| id| day|value|
+---+----------+-----+
| 1|2020-04-01| 5|
| 1|2020-04-02| 0|
| 2|2020-04-01| 5|
| 2|2020-04-02| 0|
| 3|2020-04-01| 0|
| 3|2020-04-02| 4|
+---+----------+-----+
像這樣的東西。 你可以,我將第一行分開,因為它更清楚發生了什么。 不過,您可以將其添加到“主循環”中。
data = [
("1", date(2020, 4, 1), 5),
("2", date(2020, 4, 2), 5),
("3", date(2020, 4, 3), 5),
("1", date(2020, 4, 3), 5),
]
df = spark.createDataFrame(data, ["id", "date", "value"])
row_dates = df.select("date").distinct().collect()
dates = [item.asDict()["date"] for item in row_dates]
def map_row(dates: List[date]) -> Callable[[Iterator[Row]], Iterator[Row]]:
dates.sort()
def inner(partition):
last_row = None
for row in partition:
# fill in missing dates for first row in partition
if last_row is None:
for day in dates:
if day < row.date:
yield Row(row.id, day, 0)
else:
# set current row as last row, yield current row and break out of the loop
last_row = row
yield row
break
else:
# if current row has same id as last row
if last_row.id == row.id:
# yield dates between last and current
for day in dates:
if day > last_row.date and day < row.date:
yield Row(row.id, day, 0)
# set current as last and yield current
last_row = row
yield row
else:
# if current row is new id
for day in dates:
# run potential remaining dates for last_row.id
if day > last_row.date:
yield Row(last_row.id, day, 0)
for day in dates:
# fill in missing dates before row.date
if day < row.date:
yield Row(row.id, day, 0)
else:
# unt so weiter
last_row = row
yield row
break
return inner
rdd = (
df.repartition(1, "id")
.sortWithinPartitions("id", "date")
.rdd.mapPartitions(map_row(dates))
)
new_df = spark.createDataFrame(rdd)
new_df.show(10, False)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.