简体   繁体   中英

Join two dataframe with scala spark

I have two dataframes:

  • The first dataframe DFNum has 48 columns and 58500 rows.
  • The second dataframe DFString has 7 columns and 58500 rows.

The columns of both dataframes are all different from each other. My goal is simply to join the two dataframes into one that has 55 columns (48 + 7) and always 58500 rows keeping the order they have before the join.

I made several attempts, also reading the other questions, but without success. In particular I tried:

val df = DFNum.join(DFString) and this give me the following error: Detected implicit cartesian product for INNER join between logical plans. Join condition is missing or trivial. Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; Detected implicit cartesian product for INNER join between logical plans. Join condition is missing or trivial. Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true; .

Obviously with the cross join I get many more lines than I want: 58500 * 58500 rows.

Then I tried to edit the df adding an equal column id to both dataframes to join: val tmpNum = DFNum.withColumn("id", monotonically_increasing_id()) val tmpString = DFString.withColumn("id", monotonically_increasing_id())

and use:

val df = tmpNum.join(tmpString) and this give me the following error: USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...] USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...] .

I also tried several types of joins (all with both tmpNum and tmpString and DFNum and DFString ) like: val df = tmpNum.join(tmpString, Seq("id"), "outer") val df = tmpNum.join(tmpString, Seq("id"), "full_outer") etc. but I always get the same error USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...] USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...] .

(Obviously with tmpNum and tmpString the total columns of the new dataframe will be one more. Later I will drop the id column).

If anyone has any ideas or suggestions, I would appreciate it.

If you don't have any key columns to join 2 dataframes, then you may depend upon monotonically_increasing_id

val a = Seq(("First",1), ("Secound",2), ("Third",3), ("Fourth",4)).toDF("col1", "col2")
val b = Seq(("india",980), ("japan",990), ("korea",1000), ("chaina",900)).toDF("col3", "col4")

a.show

+-------+----+
|   col1|col2|
+-------+----+
|  First|   1|
|Secound|   2|
|  Third|   3|
| Fourth|   4|
+-------+----+

b.show
+------+----+
|  col3|col4|
+------+----+
| india| 980|
| japan| 990|
| korea|1000|
|chaina| 900|
+------+----+

Then add a new column to both dataframes. Make sure that your dataframe sorted properly, otherwise after join dataframe data will mess.

val a1 = a.withColumn("id", monotonically_increasing_id)
val b1 = b.withColumn("id", monotonically_increasing_id)

Now do a join both dataframes by using id column then drop intermediate id column

a1.join(b1, Seq("id")).drop("id").show 

+-------+----+------+----+
|   col1|col2|  col3|col4|
+-------+----+------+----+
|  First|   1| india| 980|
|Secound|   2| japan| 990|
|  Third|   3| korea|1000|
| Fourth|   4|chaina| 900|
+-------+----+------+----+

Two data sets can't be joined without matching data except using cartesian. The column names need not be same but the values in the column needs to be same. You can join 2 dataframes using all their columns if you need.

val df1 = //a data frame with columns col1-1, col1-2, col1-3
val df2 = //a data frame with columns col2-1, col2-2, col2-3

val dfJoined = df1.join(df2, df1.col1-1===df2.col2-1 or df1.col1-2===df2.col2-2 or df1.col1-3===df2.col2-3)

//Then drop one set of columns if they have same data.

I tried to do this recently, with no success at all. You can try converting the two objects to pandas dataframes and then do the merge.

Step #1:

df1= df1.select("*").toPandas()
df2= df2.select("*").toPandas()

Step #2:

result = pd.concat([df1, df2], axis=1)

Good luck!!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM