简体   繁体   English

如何在Spark中将记录插入数据框

[英]How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. 我有一个数据框(df1),其中有50列,第一个是cust_id,其余的是功能。 I also have another dataframe (df2) which contains only cust_id. 我还有另一个仅包含cust_id的数据框(df2)。 I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. 我想将df2中的每个客户的一条记录添加到df1中,所有功能都设为0。但是由于两个数据框具有两个不同的架构,因此我无法进行合并。 What is the best way to do that? 最好的方法是什么?

I use a full outer join but it generates two cust_id columns and I need one. 我使用完整的外部联接,但它会生成两个cust_id列,我需要一个。 I should somehow merge these two cust_id columns but don't know how. 我应该以某种方式合并这两个cust_id列,但不知道如何。

You can try to achieve something like that by doing a full outer join like the following: 您可以通过执行完全外部联接来尝试实现以下目标:

val result = df1.join(df2, Seq("cust_id"), "full_outer")

However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be: 但是,这些功能将为null而不是0。如果您确实需要将其设置为零,则一种方法是:

val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
  (df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM