合并两个具有不同列的 spark 数据框以获取所有列

Question

Lets say I have 2 spark dataframes:假设我有 2 个 spark 数据帧：

Location    Date        Date_part   Sector      units   
USA         7/1/2021    7/1/2021    Cars        200     
IND         7/1/2021    7/1/2021    Scooters    180     
COL         7/1/2021    7/1/2021    Trucks      100

Location    Date    Brands  units   values    
UK          null    brand1  400     120       
AUS         null    brand2  450     230       
CAN         null    brand3  150     34

I need my resultant dataframe as我需要我的结果 dataframe 作为

Location    Date        Date_part   Sector      Brands  units   values
USA         7/1/2021    7/1/2021    Cars                200     
IND         7/1/2021    7/1/2021    Scooters            180     
COL         7/1/2021    7/1/2021    Trucks              100
UK          null        7/1/2021                brand1  400     120
AUS         null        7/1/2021                brand2  450     230
CAN         null        7/1/2021                brand3  150     34

So my desired df should contain all column from both dataframes also I need Date_part in all rows This is what I tried:所以我想要的 df 应该包含两个数据框中的所有列我还需要所有行中的 Date_part 这是我尝试过的：

df_result= df1.union(df_2)

But Im getting this as my result.但我得到这个作为我的结果。 The values are being swapped and one column from second dataframe is missing.正在交换值，第二个 dataframe 中的一列丢失。

Location    Date        Date_part   Sector      Brands  units
USA         7/1/2021    7/1/2021    Cars        200     
IND         7/1/2021    7/1/2021    Scooters    180     
COL         7/1/2021    7/1/2021    Trucks      100
UK          null        brand1                  400     120
AUS         null        brand2                  450     230
CAN         null        brand3                  150     34

Any suggestions plsss任何建议请

Answer 1

union : this function resolves columns by position (not by name) union ：这个 function 按 position （不是按名称）解析列

That is the reason why you believed "The values are being swapped and one column from second dataframe is missing."这就是为什么您认为“正在交换值并且缺少第二个 dataframe 中的一列”的原因。

You should use unionByName , but this functions requires both dataframe to have the same structure.您应该使用unionByName ，但此函数要求两个 dataframe 具有相同的结构。

I offer you this simple code to harmonize the structure of your dataframes and then do the union(ByName).我为您提供了这个简单的代码来协调您的数据帧的结构，然后执行 union(ByName)。

from pyspark.sql import DataFrame
from pyspark.sql import functions as F

def add_missing_columns(df: DataFrame, ref_df: DataFrame) -> DataFrame:
    """Add missing columns from ref_df to df

    Args:
        df (DataFrame): dataframe with missing columns
        ref_df (DataFrame): referential dataframe

    Returns:
        DataFrame: df with additionnal columns from ref_df
    """
    for col in ref_df.schema:
        if col.name not in df.columns:
            df = df.withColumn(col.name, F.lit(None).cast(col.dataType))

    return df


df1 = add_missing_columns(df1, df2)
df2 = add_missing_columns(df2, df1)

df_result = df1.unionByName(df2)

Answer 2

This is an add-on to @Steven's response (since I don't have enough reputation to comment directly under his post):这是@Steven 回复的附加内容（因为我没有足够的声誉直接在他的帖子下发表评论）：

Apart from the optional argument suggested by @minus34 for Spark 3.1+ and above, @Steven's solution ( add_missing_columns ) is a perfect workaround.除了@minus34 为 Spark 3.1+ 及更高版本建议的可选参数之外，@Steven 的解决方案（ add_missing_columns ）是一个完美的解决方法。 However, calling withColumn introduces a projection internally, which when called in a large loop generates big plans that can potentially cause performance issues, eventually amounting to a StackOverflowError for large datasets.但是，调用withColumn会在内部引入一个投影，当在大循环中调用它时会生成可能导致性能问题的大计划，最终会导致大型数据集出现StackOverflowError 。

A scalable modification of @Steven's code could be: @Steven 代码的可扩展修改可以是：

from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from pyspark.sql import types as T

def add_missing_columns(df: DataFrame, ref_df: DataFrame) -> DataFrame:
    """Add missing columns from ref_df to df

    Args:
        df (DataFrame): dataframe with missing columns
        ref_df (DataFrame): referential dataframe

    Returns:
        DataFrame: df with additionnal columns from ref_df
    """
    missing_col = []
    for col in ref_df.schema:
        if col.name not in df.columns:
            missing_col.append(col.name)
            
    df = df.select(['*'] + [F.lit(None).cast(T.NullType()).alias(c) for c in missing_col])

    return df

select is therefore a possible alternative, and it might be better to cast new empty columns of value None to NullType() , as you needn't specify the specific data type to cast this empty column to!因此select是一个可能的替代方案，将值None的新空列转换为NullType()可能会更好，因为您不需要指定特定的数据类型来将此空列转换为！ ( NullType() works fine in union and unionByName with any data type in spark) （ NullType()在union和unionByName中与 spark 中的任何数据类型都能正常工作）

合并两个具有不同列的 spark 数据框以获取所有列

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-08-19 09:23:40

解决方案2
0 2022-11-18 03:26:32

合并两个具有不同列的 spark 数据框以获取所有列

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-08-19 09:23:40

解决方案2 0 2022-11-18 03:26:32

解决方案1
3 已采纳 2021-08-19 09:23:40

解决方案2
0 2022-11-18 03:26:32