简体   繁体   English

比较两个数据框pyspark中的列名

[英]Compare column names in two data frames pyspark

I have two data frames in pyspark df and data . 我在pyspark dfdata有两个数据帧。 The schema are like below 架构如下

>>> df.printSchema()
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- nation: string (nullable = true)
 |-- Date: timestamp (nullable = false)
 |-- ZipCode: integer (nullable = true)
 |-- car: string (nullable = true)
 |-- van: string (nullable = true)

>>> data.printSchema()
root 
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- nation: string (nullable = true)
 |-- date: string (nullable = true)
 |-- zipcode: integer (nullable = true)

Now I want to add columns car and van to my data data frame by comparing both the schema. 现在,我想通过比较两个架构将car和van列添加到我的data数据框中。

I would also want to compares two data frames if the columns are same do nothing, but if the columns are different then add the columns to the data frame that doesn't have the columns. 如果列相同,我也想比较两个数据帧,但是如果列不同,则将列添加到没有列的数据帧中。

How can we achieve that in pyspark. 我们如何在pyspark中实现这一目标。

FYI I am using spark 1.6 仅供参考,我正在使用spark 1.6

once the columns are added to the data frame. 一旦将列添加到数据框中。 The values for those columns in the newly added data frame Should be null. 新添加的数据框中这些列的值应为null。

for example here we are adding columns to data data frame so the columns car and van in data data frame should contain null values but the same columns in df data frame should have their original values 例如,在这里我们向data数据帧添加列,因此data数据帧中的car和van列应包含空值,但df数据帧中的相同列应具有其原始值

what happens if there are more than 2 new columns to be added 如果要添加两个以上的新列会发生什么

As the schema is not but StructType consisting of list of StructFields, we can retrieve the fields list, to compare and find the missing columns, 由于架构不是StructType,而是由StructFields列表组成的,因此我们可以检索字段列表,以比较并查找缺少的列,

df_schema = df.schema.fields
data_schema = data.schema.fields
df_names = [x.name.lower() for x in df_scehma]
data_names = [x.name.lower() for x in data_schema]
if df_schema <> data_schema:
    col_diff = set(df_names) ^ set(data_names)      
    col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if ((x[0] is not None and x[0].name.lower() in col_diff) or x[1].name.lower() in col_diff)]
     for i in col_list:
        if i[0] in df_names:
            data = data.withColumn("%s"%i[0],lit(None).cast(i[1]))
        else:
            df = df.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
    print "Nothing to do"

You have mentioned to add the column if there is no null values, but your schema diference are nullable columns, so have not used that check. 您已经提到如果没有空值则添加该列,但是您的架构差异是可空列,因此没有使用该检查。 If you need it, then add check for nullable as below, 如果需要,请按如下所示添加可空值检查,

col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if (x[0].name.lower() in col_diff or x[1].name.lower() in col_diff) and not x.nullable]

Please check the documentation for more about StructType and StructFields, https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.types.StructType 请查看文档以获取有关StructType和StructFields的更多信息, https: //spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.types.StructType

If you have to do this to multiple tables, it might be worth it to generalize the code a bit. 如果必须对多个表执行此操作,则值得对代码进行泛化。 This code takes the first non-null value in the non-matching source column to create the new column in the target table. 此代码采用非匹配源列中的第一个非空值在目标表中创建新列。

from pyspark.sql.functions import lit, first

def first_non_null(f,t): # find the first non-null value of a column
    return f.select(first(f[t], ignorenulls=True)).first()[0]

def match_type(f1,f2,miss): # add missing column to the target table
    for i in miss:
        try:
            f1 = f1.withColumn(i, lit(first_non_null(f2,i)))
        except:
            pass
        try:
            f2 = f2.withColumn(i, lit(first_non_null(f1,i)))
        except:
            pass
    return f1, f2

def column_sync_up(d1,d2): # test if the matching requirement is met
    missing = list(set(d1.columns) ^ set(d2.columns))
    if len(missing)>0:
        return match_type(d1,d2,missing)
    else:
        print "Columns Match!"

df1, df2 = column_sync_up(df1,df2) # reuse as necessary

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM