简体   繁体   中英

Compare column names in two data frames pyspark

I have two data frames in pyspark df and data . The schema are like below

>>> df.printSchema()
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- nation: string (nullable = true)
 |-- Date: timestamp (nullable = false)
 |-- ZipCode: integer (nullable = true)
 |-- car: string (nullable = true)
 |-- van: string (nullable = true)

>>> data.printSchema()
root 
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- nation: string (nullable = true)
 |-- date: string (nullable = true)
 |-- zipcode: integer (nullable = true)

Now I want to add columns car and van to my data data frame by comparing both the schema.

I would also want to compares two data frames if the columns are same do nothing, but if the columns are different then add the columns to the data frame that doesn't have the columns.

How can we achieve that in pyspark.

FYI I am using spark 1.6

once the columns are added to the data frame. The values for those columns in the newly added data frame Should be null.

for example here we are adding columns to data data frame so the columns car and van in data data frame should contain null values but the same columns in df data frame should have their original values

what happens if there are more than 2 new columns to be added

As the schema is not but StructType consisting of list of StructFields, we can retrieve the fields list, to compare and find the missing columns,

df_schema = df.schema.fields
data_schema = data.schema.fields
df_names = [x.name.lower() for x in df_scehma]
data_names = [x.name.lower() for x in data_schema]
if df_schema <> data_schema:
    col_diff = set(df_names) ^ set(data_names)      
    col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if ((x[0] is not None and x[0].name.lower() in col_diff) or x[1].name.lower() in col_diff)]
     for i in col_list:
        if i[0] in df_names:
            data = data.withColumn("%s"%i[0],lit(None).cast(i[1]))
        else:
            df = df.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
    print "Nothing to do"

You have mentioned to add the column if there is no null values, but your schema diference are nullable columns, so have not used that check. If you need it, then add check for nullable as below,

col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if (x[0].name.lower() in col_diff or x[1].name.lower() in col_diff) and not x.nullable]

Please check the documentation for more about StructType and StructFields, https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.types.StructType

If you have to do this to multiple tables, it might be worth it to generalize the code a bit. This code takes the first non-null value in the non-matching source column to create the new column in the target table.

from pyspark.sql.functions import lit, first

def first_non_null(f,t): # find the first non-null value of a column
    return f.select(first(f[t], ignorenulls=True)).first()[0]

def match_type(f1,f2,miss): # add missing column to the target table
    for i in miss:
        try:
            f1 = f1.withColumn(i, lit(first_non_null(f2,i)))
        except:
            pass
        try:
            f2 = f2.withColumn(i, lit(first_non_null(f1,i)))
        except:
            pass
    return f1, f2

def column_sync_up(d1,d2): # test if the matching requirement is met
    missing = list(set(d1.columns) ^ set(d2.columns))
    if len(missing)>0:
        return match_type(d1,d2,missing)
    else:
        print "Columns Match!"

df1, df2 = column_sync_up(df1,df2) # reuse as necessary

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM