Pyspark：从另一个 pyspark dataframe 添加新列

Question

我有两个数据框如下。 我想根据条件df_a.col_p == df_b.id从 dataframe df_b列val_1向 dataframe df_a添加一个新列

df_a = sqlContext.createDataFrame([(1412, 31, 1), (2422, 21, 1), (4223, 22, 2), (
    2244, 43, 1), (1235, 54, 1), (4126, 12, 5), (2342, 44, 1 )], ["idx", "col_n", "col_p"])
df_a.show()

+----+-----+-----+
| idx|col_n|col_p|
+----+-----+-----+
|1412|   31|    1|
|2422|   21|    1|
|4223|   22|    2|
|2244|   43|    1|
|1235|   54|    1|
|4126|   12|    5|
|2342|   44|    1|
+----+-----+-----+

df_b = sqlContext.createDataFrame([(1, 1, 1), (2, 1, 1), (3, 1, 2), (
    4, 1, 1), (5, 2, 1), (6, 2, 2)], ["id", "val_1", "val_2"])
df_b.show()

+---+-----+-----+
| id|val_1|val_2|
+---+-----+-----+
|  1|    1|    1|
|  2|    1|    1|
|  3|    1|    2|
|  4|    1|    1|
|  5|    2|    1|
|  6|    2|    2|
+---+-----+-----+

预期 output

+----+-----+-----+-----+
| idx|col_n|col_p|val_1|
+----+-----+-----+-----+
|1412|   31|    1|    1|
|2422|   21|    1|    1|
|4223|   22|    2|    1|
|2244|   43|    1|    1|
|1235|   54|    1|    1|
|4126|   12|    5|    2|
|2342|   44|    1|    1|
+----+-----+-----+-----+

我的代码

cond = (df_a.col_p == df_b.id) 
df_a_new = df_a.join(df_b, cond, how ='full').withColumn('val_new', F.when(cond, df_b.val_1))
df_a_new = df_a_new.drop(*['id', 'val_1', 'val_2'])
df_a_new = df_a_new.filter(df_a_new.idx. isNotNull())
df_a_new.show()

如何以正确的索引顺序获得正确的 output 作为预期结果？

Answer 1

您可以为df_a分配一个递增的索引，并在加入后按该索引排序。 另外我建议进行左连接而不是完全连接。

import pyspark.sql.functions as F

df_a_new = df_a.withColumn('index', F.monotonically_increasing_id()) \
               .join(df_b, df_a.col_p == df_b.id, 'left') \
               .orderBy('index') \
               .select('idx', 'col_n', 'col_p', 'val_1')

df_a_new.show()
+----+-----+-----+-----+
| idx|col_n|col_p|val_1|
+----+-----+-----+-----+
|1412|   31|    1|    1|
|2422|   21|    1|    1|
|4223|   22|    2|    1|
|2244|   43|    1|    1|
|1235|   54|    1|    1|
|4126|   12|    5|    2|
|2342|   44|    1|    1|
+----+-----+-----+-----+

Answer 2

您需要创建自己的索引（monotomically_increasing_ids）并在加入这些索引后再次排序。 但是，您无法在保留 Spark 中的顺序的同时加入，因为在加入之前对行进行了分区，并且在组合之前它们失去了顺序，请参阅： Dataframe 在 Spark 中加入保留顺序吗？

Pyspark：从另一个 pyspark dataframe 添加新列

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-01-04 06:36:46

解决方案2
1 2021-01-04 06:38:54

Pyspark：从另一个 pyspark dataframe 添加新列

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-01-04 06:36:46

解决方案2 1 2021-01-04 06:38:54

解决方案1
3 已采纳 2021-01-04 06:36:46

解决方案2
1 2021-01-04 06:38:54