简体   繁体   中英

How to select all columns of a dataframe in join - Spark-scala

I am doing join of 2 data frames and select all columns of left frame for example:

val join_df = first_df.join(second_df, first_df("id") === second_df("id") , "left_outer")

in above I want to do select first_df.* .How can I select all columns of one frame in join ?

使用别名:

first_df.alias("fst").join(second_df, Seq("id"), "left_outer").select("fst.*")

Suppose you:

  1. Want to use the DataFrame syntax.
  2. Want to select all columns from df1 but only a couple from df2.
  3. This is cumbersome to list out explicitly due to the number of columns in df1.

Then, you might do the following:

val selectColumns = df1.columns.map(df1(_)) ++ Array(df2("field1"), df2("field2"))
df1.join(df2, df1("key") === df2("key")).select(selectColumns:_*)

We can also do it with leftsemi join. leftsemi join will select the data from left side dataframe from a joined dataframe.

Here we join two dataframes df1 and df2 based on column col1.

    df1.join(df2, df1.col("col1").equalTo(df2.col("col1")), "leftsemi") 

Just to add one possibility, whithout using alias, I was able to do that in pyspark with

   first_df.join(second_df, "id", "left_outer").select( first_df["*"] )

Not sure if applies here, but hope it helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM