I want to merge two columns from separate DataFrames in one DataFrames
I have two DataFrames like this
val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")
ds1.show()
+-----+
| Col1|
+-----+
| 0|
| 1|
| 0|
| 1|
+-----+
ds2.show()
+-----+
| Col2|
+-----+
| 234|
| 43|
| 341|
| 42|
+-----+
I want 3rd dataframe containing two columns Col1 and Col2
+-----++-----+
| Col1|| Col2|
+-----++-----+
| 0|| 234|
| 1|| 43|
| 0|| 341|
| 1|| 42|
+-----++-----+
I tried union
val ds3 = ds1.union(ds2)
But, it adds all row of ds2
to ds1
.
monotonically_increasing_id <-- is not Deterministic .
Hence it is not guaranteed that you would get correct result
Easier to do by using RDD and creating key by using zipWithIndex
val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")
// Convert to RDD with ZIPINDEX < Which will be our key
val ds1Rdd = ds1.rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }
val ds2Rdd = ds2.as[(Int)].rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }
// Check How The KEY-VALUE Pair looks
ds1Rdd.collect()
res50: Array[(Long, Int)] = Array((0,0), (1,1), (2,1), (3,0))
res51: Array[(Long, Int)] = Array((0,341), (1,42), (2,43), (3,234))
So First element of the tuple is our Join key
we simply join and rearrange to result dataframe
val joinedRdd = ds1Rdd.join(ds2Rdd)
val resultrdd = joinedRdd.map(x => x._2).map(x => (x._1 ,x._2))
// resultrdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[204] at map at <console>
And we convert to DataFrame
resultrdd.toDF("Col1","Col2").show()
+----+----+
|Col1|Col2|
+----+----+
| 0| 341|
| 1| 42|
| 1| 43|
| 0| 234|
+----+----+
I think in this case concat is what you want:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
You can create an additional column id
using monotonically_increasing_id
. Then you join two dataframes on this column.
scala> ds1.show
+----+
|Col1|
+----+
| 1|
| 0|
| 1|
| 0|
+----+
scala> ds2.show
+----+
|Col2|
+----+
| 234|
| 43|
| 341|
| 42|
+----+
scala> ds1.withColumn("id", monotonically_increasing_id).join(ds2.withColumn("id", monotonically_increasing_id), "id").drop("id").show
+----+----+
|Col1|Col2|
+----+----+
| 1| 234|
| 0| 42|
| 1| 341|
| 0| 43|
+----+----+
If you are doing union, intersection etc two queries or DataFrames, they must be "union compatible", means that they are same column definition with compatible data type.
If both DataFrames have same number of column then easiest solution to use new UnionByName API whereas if have different schema it advisable to create compatible view before merging
You can create below function to to make it compatible select query
def merge(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
Then use merge method to create compatible select query as mentioned below.
import org.apache.spark.sql.functions._
val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("col2")
val cols1 = ds1.columns.toSet
val cols2 = ds2.columns.toSet
val unionCol = cols1 ++ cols2
val ds3=ds1.select(merge(cols1, unionCol): _*).unionAll(ds2.select(merge(cols2, unionCol): _*))
scala> ds3.show
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 0|null|
| 1|null|
| 0|null|
|null| 234|
|null| 43|
|null| 341|
|null| 42|
+----+----+
You can also use unionByName which can eliminate issue of column ordering
val ds3=ds1.select(merge(cols1, unionCol): _*).unionByName(ds2.select(merge(cols2, unionCol): _*))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.