简体   繁体   中英

Merge two columns of different DataFrames in Spark using scala

I want to merge two columns from separate DataFrames in one DataFrames

I have two DataFrames like this

val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")
ds1.show()

+-----+
| Col1|
+-----+
|    0|
|    1|
|    0|
|    1|
+-----+

ds2.show()
+-----+
| Col2|
+-----+
|  234|
|   43|
|  341|
|   42|
+-----+

I want 3rd dataframe containing two columns Col1 and Col2

+-----++-----+
| Col1|| Col2|
+-----++-----+
|    0||  234|
|    1||   43|
|    0||  341|
|    1||   42|
+-----++-----+

I tried union

val ds3 = ds1.union(ds2)

But, it adds all row of ds2 to ds1 .

monotonically_increasing_id <-- is not Deterministic .

Hence it is not guaranteed that you would get correct result

Easier to do by using RDD and creating key by using zipWithIndex

val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")

// Convert to RDD with ZIPINDEX < Which will be our key

val ds1Rdd = ds1.rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }

val ds2Rdd = ds2.as[(Int)].rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }

// Check How The KEY-VALUE Pair looks

ds1Rdd.collect()

res50: Array[(Long, Int)] = Array((0,0), (1,1), (2,1), (3,0))

res51: Array[(Long, Int)] = Array((0,341), (1,42), (2,43), (3,234))

So First element of the tuple is our Join key

we simply join and rearrange to result dataframe

val joinedRdd = ds1Rdd.join(ds2Rdd)

val resultrdd = joinedRdd.map(x => x._2).map(x => (x._1 ,x._2))

// resultrdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[204] at map at <console>

And we convert to DataFrame

 resultrdd.toDF("Col1","Col2").show()
+----+----+
|Col1|Col2|
+----+----+
|   0| 341|
|   1|  42|
|   1|  43|
|   0| 234|
+----+----+

I think in this case concat is what you want:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

You can create an additional column id using monotonically_increasing_id . Then you join two dataframes on this column.

scala> ds1.show
+----+
|Col1|
+----+
|   1|
|   0|
|   1|
|   0|
+----+


scala> ds2.show
+----+
|Col2|
+----+
| 234|
|  43|
| 341|
|  42|
+----+ 

scala> ds1.withColumn("id", monotonically_increasing_id).join(ds2.withColumn("id", monotonically_increasing_id), "id").drop("id").show
+----+----+
|Col1|Col2|
+----+----+
|   1| 234|
|   0|  42|
|   1| 341|
|   0|  43|
+----+----+

If you are doing union, intersection etc two queries or DataFrames, they must be "union compatible", means that they are same column definition with compatible data type.

If both DataFrames have same number of column then easiest solution to use new UnionByName API whereas if have different schema it advisable to create compatible view before merging

You can create below function to to make it compatible select query

def merge(myCols: Set[String], allCols: Set[String]) = {
    allCols.toList.map(x => x match {
      case x if myCols.contains(x) => col(x)
      case _ => lit(null).as(x)
    })
  }

Then use merge method to create compatible select query as mentioned below.

import org.apache.spark.sql.functions._
    val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("col1")
    val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("col2")
    val cols1 = ds1.columns.toSet
    val cols2 = ds2.columns.toSet
    val unionCol = cols1 ++ cols2 
    val ds3=ds1.select(merge(cols1, unionCol): _*).unionAll(ds2.select(merge(cols2, unionCol): _*))
    scala> ds3.show
    +----+----+
    |col1|col2|
    +----+----+
    |   1|null|
    |   0|null|
    |   1|null|
    |   0|null|
    |null| 234|
    |null|  43|
    |null| 341|
    |null|  42|
    +----+----+

You can also use unionByName which can eliminate issue of column ordering

val ds3=ds1.select(merge(cols1, unionCol): _*).unionByName(ds2.select(merge(cols2, unionCol): _*))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM