I have two dataframes:
DFNum
has 48 columns and 58500 rows.DFString
has 7 columns and 58500 rows.The columns of both dataframes are all different from each other. My goal is simply to join the two dataframes into one that has 55 columns (48 + 7) and always 58500 rows keeping the order they have before the join.
I made several attempts, also reading the other questions, but without success. In particular I tried:
val df = DFNum.join(DFString)
and this give me the following error: Detected implicit cartesian product for INNER join between logical plans. Join condition is missing or trivial. Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true;
Detected implicit cartesian product for INNER join between logical plans. Join condition is missing or trivial. Join condition is missing or trivial. Either: use the CROSS JOIN syntax to allow cartesian products between these relations, or: enable implicit cartesian products by setting the configuration variable spark.sql.crossJoin.enabled=true;
.
Obviously with the cross join I get many more lines than I want: 58500 * 58500 rows.
Then I tried to edit the df adding an equal column id
to both dataframes to join: val tmpNum = DFNum.withColumn("id", monotonically_increasing_id())
val tmpString = DFString.withColumn("id", monotonically_increasing_id())
and use:
val df = tmpNum.join(tmpString)
and this give me the following error: USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...]
USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...]
.
I also tried several types of joins (all with both tmpNum
and tmpString
and DFNum
and DFString
) like: val df = tmpNum.join(tmpString, Seq("id"), "outer")
val df = tmpNum.join(tmpString, Seq("id"), "full_outer")
etc. but I always get the same error USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...]
USING column `id` cannot be resolved on the left side of the join. The left-side columns:[...]
.
(Obviously with tmpNum
and tmpString
the total columns of the new dataframe will be one more. Later I will drop the id
column).
If anyone has any ideas or suggestions, I would appreciate it.
If you don't have any key columns to join 2 dataframes, then you may depend upon monotonically_increasing_id
val a = Seq(("First",1), ("Secound",2), ("Third",3), ("Fourth",4)).toDF("col1", "col2")
val b = Seq(("india",980), ("japan",990), ("korea",1000), ("chaina",900)).toDF("col3", "col4")
a.show
+-------+----+
| col1|col2|
+-------+----+
| First| 1|
|Secound| 2|
| Third| 3|
| Fourth| 4|
+-------+----+
b.show
+------+----+
| col3|col4|
+------+----+
| india| 980|
| japan| 990|
| korea|1000|
|chaina| 900|
+------+----+
Then add a new column to both dataframes. Make sure that your dataframe sorted properly, otherwise after join dataframe data will mess.
val a1 = a.withColumn("id", monotonically_increasing_id)
val b1 = b.withColumn("id", monotonically_increasing_id)
Now do a join both dataframes by using id
column then drop intermediate id
column
a1.join(b1, Seq("id")).drop("id").show
+-------+----+------+----+
| col1|col2| col3|col4|
+-------+----+------+----+
| First| 1| india| 980|
|Secound| 2| japan| 990|
| Third| 3| korea|1000|
| Fourth| 4|chaina| 900|
+-------+----+------+----+
Two data sets can't be joined without matching data except using cartesian. The column names need not be same but the values in the column needs to be same. You can join 2 dataframes using all their columns if you need.
val df1 = //a data frame with columns col1-1, col1-2, col1-3
val df2 = //a data frame with columns col2-1, col2-2, col2-3
val dfJoined = df1.join(df2, df1.col1-1===df2.col2-1 or df1.col1-2===df2.col2-2 or df1.col1-3===df2.col2-3)
//Then drop one set of columns if they have same data.
I tried to do this recently, with no success at all. You can try converting the two objects to pandas dataframes and then do the merge.
Step #1:
df1= df1.select("*").toPandas()
df2= df2.select("*").toPandas()
Step #2:
result = pd.concat([df1, df2], axis=1)
Good luck!!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.