I have two data frames in pyspark df
and data
. The schema are like below
>>> df.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- Date: timestamp (nullable = false)
|-- ZipCode: integer (nullable = true)
|-- car: string (nullable = true)
|-- van: string (nullable = true)
>>> data.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- date: string (nullable = true)
|-- zipcode: integer (nullable = true)
Now I want to add columns car and van to my data
data frame by comparing both the schema.
I would also want to compares two data frames if the columns are same do nothing, but if the columns are different then add the columns to the data frame that doesn't have the columns.
How can we achieve that in pyspark.
FYI I am using spark 1.6
once the columns are added to the data frame. The values for those columns in the newly added data frame Should be null.
for example here we are adding columns to
data
data frame so the columns car and van in data data frame should contain null values but the same columns in df data frame should have their original valueswhat happens if there are more than 2 new columns to be added
As the schema is not but StructType consisting of list of StructFields, we can retrieve the fields list, to compare and find the missing columns,
df_schema = df.schema.fields
data_schema = data.schema.fields
df_names = [x.name.lower() for x in df_scehma]
data_names = [x.name.lower() for x in data_schema]
if df_schema <> data_schema:
col_diff = set(df_names) ^ set(data_names)
col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if ((x[0] is not None and x[0].name.lower() in col_diff) or x[1].name.lower() in col_diff)]
for i in col_list:
if i[0] in df_names:
data = data.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
df = df.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
print "Nothing to do"
You have mentioned to add the column if there is no null values, but your schema diference are nullable columns, so have not used that check. If you need it, then add check for nullable as below,
col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if (x[0].name.lower() in col_diff or x[1].name.lower() in col_diff) and not x.nullable]
Please check the documentation for more about StructType and StructFields, https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.types.StructType
If you have to do this to multiple tables, it might be worth it to generalize the code a bit. This code takes the first non-null value in the non-matching source column to create the new column in the target table.
from pyspark.sql.functions import lit, first
def first_non_null(f,t): # find the first non-null value of a column
return f.select(first(f[t], ignorenulls=True)).first()[0]
def match_type(f1,f2,miss): # add missing column to the target table
for i in miss:
try:
f1 = f1.withColumn(i, lit(first_non_null(f2,i)))
except:
pass
try:
f2 = f2.withColumn(i, lit(first_non_null(f1,i)))
except:
pass
return f1, f2
def column_sync_up(d1,d2): # test if the matching requirement is met
missing = list(set(d1.columns) ^ set(d2.columns))
if len(missing)>0:
return match_type(d1,d2,missing)
else:
print "Columns Match!"
df1, df2 = column_sync_up(df1,df2) # reuse as necessary
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.