Lets say I have 2 spark dataframes:
Location Date Date_part Sector units
USA 7/1/2021 7/1/2021 Cars 200
IND 7/1/2021 7/1/2021 Scooters 180
COL 7/1/2021 7/1/2021 Trucks 100
Location Date Brands units values
UK null brand1 400 120
AUS null brand2 450 230
CAN null brand3 150 34
I need my resultant dataframe as
Location Date Date_part Sector Brands units values
USA 7/1/2021 7/1/2021 Cars 200
IND 7/1/2021 7/1/2021 Scooters 180
COL 7/1/2021 7/1/2021 Trucks 100
UK null 7/1/2021 brand1 400 120
AUS null 7/1/2021 brand2 450 230
CAN null 7/1/2021 brand3 150 34
So my desired df should contain all column from both dataframes also I need Date_part in all rows This is what I tried:
df_result= df1.union(df_2)
But Im getting this as my result. The values are being swapped and one column from second dataframe is missing.
Location Date Date_part Sector Brands units
USA 7/1/2021 7/1/2021 Cars 200
IND 7/1/2021 7/1/2021 Scooters 180
COL 7/1/2021 7/1/2021 Trucks 100
UK null brand1 400 120
AUS null brand2 450 230
CAN null brand3 150 34
Any suggestions plsss
union
: this function resolves columns by position (not by name)
That is the reason why you believed "The values are being swapped and one column from second dataframe is missing."
You should use unionByName
, but this functions requires both dataframe to have the same structure.
I offer you this simple code to harmonize the structure of your dataframes and then do the union(ByName).
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
def add_missing_columns(df: DataFrame, ref_df: DataFrame) -> DataFrame:
"""Add missing columns from ref_df to df
Args:
df (DataFrame): dataframe with missing columns
ref_df (DataFrame): referential dataframe
Returns:
DataFrame: df with additionnal columns from ref_df
"""
for col in ref_df.schema:
if col.name not in df.columns:
df = df.withColumn(col.name, F.lit(None).cast(col.dataType))
return df
df1 = add_missing_columns(df1, df2)
df2 = add_missing_columns(df2, df1)
df_result = df1.unionByName(df2)
This is an add-on to @Steven's response (since I don't have enough reputation to comment directly under his post):
Apart from the optional argument suggested by @minus34 for Spark 3.1+ and above, @Steven's solution ( add_missing_columns
) is a perfect workaround. However, calling withColumn
introduces a projection internally, which when called in a large loop generates big plans that can potentially cause performance issues, eventually amounting to a StackOverflowError
for large datasets.
A scalable modification of @Steven's code could be:
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from pyspark.sql import types as T
def add_missing_columns(df: DataFrame, ref_df: DataFrame) -> DataFrame:
"""Add missing columns from ref_df to df
Args:
df (DataFrame): dataframe with missing columns
ref_df (DataFrame): referential dataframe
Returns:
DataFrame: df with additionnal columns from ref_df
"""
missing_col = []
for col in ref_df.schema:
if col.name not in df.columns:
missing_col.append(col.name)
df = df.select(['*'] + [F.lit(None).cast(T.NullType()).alias(c) for c in missing_col])
return df
select
is therefore a possible alternative, and it might be better to cast new empty columns of value None
to NullType()
, as you needn't specify the specific data type to cast this empty column to! ( NullType()
works fine in union
and unionByName
with any data type in spark)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.