Concatenate columns in Apache Spark DataFrame

Question

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

Answer 1

With raw SQL you can use CONCAT :

In Python

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v")) df.registerTempTable("df") sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")

In Scala

import sqlContext.implicits._ val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v") df.registerTempTable("df") sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")

Since Spark 1.5.0 you can use concat function with DataFrame API:

In Python :

 from pyspark.sql.functions import concat, col, lit df.select(concat(col("k"), lit(" "), col("v")))

In Scala :

 import org.apache.spark.sql.functions.{concat, lit} df.select(concat($"k", lit(" "), $"v"))

There is also concat_ws function which takes a string separator as the first argument.

Answer 2

Here's how you can do custom naming

import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()

gives,

+--------+--------+
|colname1|colname2|
+--------+--------+
|   row11|   row12|
|   row21|   row22|
+--------+--------+

create new column by concatenating:

df = df.withColumn('joined_column', 
                    sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()

+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
|   row11|   row12|  row11_row12|
|   row21|   row22|  row21_row22|
+--------+--------+-------------+

Answer 3

One option to concatenate string columns in Spark Scala is using concat .

It is necessary to check for null values . Because if one of the columns is null, the result will be null even if one of the other columns do have information.

Using concat and withColumn :

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))

Using concat and select :

val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")

With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.

Answer 4

concat(cols)*

v1.5 and higher

Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.

Eg: new_df = df.select(concat(df.a, df.b, df.c))

concat_ws(sep, cols)*

v1.5 and higher

Similar to concat but uses the specified separator.

Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))

map_concat(cols)*

v2.4 and higher

Used to concat maps, returns the union of all the given maps.

Eg: new_df = df.select(map_concat("map1", "map2"))

Using concat operator ( || ):

v2.3 and higher

Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")

Reference: Spark sql doc

Answer 5

If you want to do it using DF, you could use a udf to add a new column based on existing columns.

val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)

//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
    Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))

//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )

//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()

Answer 6

From Spark 2.3( SPARK-22771 ) Spark SQL supports the concatenation operator || .

For example;

val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")

Answer 7

Here is another way of doing this for pyspark:

#import concat and lit functions from pyspark.sql.functions 
from pyspark.sql.functions import concat, lit

#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])

#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))

#Show the new data frame
personDF.show()

----------RESULT-------------------------

84
+------------+
|East African|
+------------+
|   Ethiopian|
|      Kenyan|
|     Ugandan|
|     Rwandan|
+------------+

Answer 8

当您不知道 Dataframe 中列的数量或名称时，这是一个建议。

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

Answer 9

我们是否有对应于以下过程的java语法

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

Answer 10

在 Spark 2.3.0 中，您可以：

spark.sql( """ select '1' || column_a from table_a """)

Answer 11

In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.

SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
                        .withColumn("concatenatedCol",
                                concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));


class JavaSparkSessionSingleton {
    private static transient SparkSession instance = null;

    public static SparkSession getInstance(SparkConf sparkConf) {
        if (instance == null) {
            instance = SparkSession.builder().config(sparkConf)
                    .getOrCreate();
        }
        return instance;
    }
}

The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".

Answer 12

In my case, I wanted a Pipe-'I' delimited row.

from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()

This worked well like a hot knife over butter.

Answer 13

use concat method like this:

Dataset<Row> DF2 = DF1
            .withColumn("NEW_COLUMN",concat(col("ADDR1"),col("ADDR2"),col("ADDR3"))).as("NEW_COLUMN")

Answer 14

Another way to do it in pySpark using sqlContext...

#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])

# Now we can concatenate columns and assign the new column a name 
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))

Answer 15

Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like: SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;

Also, from Spark 2.3.0, you can use commands in lines with: SELECT col1 || col2 AS concat_column_name FROM <table_name>; SELECT col1 || col2 AS concat_column_name FROM <table_name>;

Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.

Answer 16

我们也可以简单地使用SelectExpr 。

df1.selectExpr("*","upper(_2||_3) as new")

Answer 17

We can use concat() in select method of dataframe

val fullName = nameDF.select(concat(col("FirstName"), lit(" "), col("LastName")).as("FullName"))

Using withColumn and concat

val fullName1 = nameDF.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))

Using spark.sql concat function

val fullNameSql = spark.sql("select Concat(FirstName, LastName) as FullName from names")

Taken from https://www.sparkcodehub.com/spark-dataframe-concat-column

Answer 18

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))

Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))

Concatenate columns in Apache Spark DataFrame

Question

18 answers

solution1
220 2015-07-16 10:50:22

solution2
60 2017-04-26 21:50:51

solution3
41 2018-03-29 07:03:14

solution4
24 2020-05-20 18:32:57

concat(cols)*

concat_ws(sep, cols)*

map_concat(cols)*

solution5
18 2015-07-20 22:14:18

solution6
13 2018-04-19 14:09:43

solution7
11 2016-07-16 17:29:19

solution8
9 2017-08-17 17:46:45

solution9
3 2020-03-10 04:13:35

solution10
2 2018-03-12 20:24:29

solution11
1 2018-04-19 18:19:52

solution12
1 2020-12-05 17:54:30

solution13
1 2021-11-09 06:17:14

solution14
0 2017-01-10 17:43:55

solution15
0

solution16
0 2020-06-07 15:19:37

solution17
0 2022-06-12 15:09:50

solution18
-2 2020-11-16 20:53:57

Concatenate columns in Apache Spark DataFrame

Question

18 answers

solution1 220 2015-07-16 10:50:22

solution2 60 2017-04-26 21:50:51

solution3 41 2018-03-29 07:03:14

solution4 24 2020-05-20 18:32:57

concat(*cols)

concat_ws(sep, *cols)

map_concat(*cols)

solution5 18 2015-07-20 22:14:18

solution6 13 2018-04-19 14:09:43

solution7 11 2016-07-16 17:29:19

solution8 9 2017-08-17 17:46:45

solution9 3 2020-03-10 04:13:35

solution10 2 2018-03-12 20:24:29

solution11 1 2018-04-19 18:19:52

solution12 1 2020-12-05 17:54:30

solution13 1 2021-11-09 06:17:14

solution14 0 2017-01-10 17:43:55

solution15 0

solution16 0 2020-06-07 15:19:37

solution17 0 2022-06-12 15:09:50

solution18 -2 2020-11-16 20:53:57

solution1
220 2015-07-16 10:50:22

solution2
60 2017-04-26 21:50:51

solution3
41 2018-03-29 07:03:14

solution4
24 2020-05-20 18:32:57

concat(cols)*

concat_ws(sep, cols)*

map_concat(cols)*

solution5
18 2015-07-20 22:14:18

solution6
13 2018-04-19 14:09:43

solution7
11 2016-07-16 17:29:19

solution8
9 2017-08-17 17:46:45

solution9
3 2020-03-10 04:13:35

solution10
2 2018-03-12 20:24:29

solution11
1 2018-04-19 18:19:52

solution12
1 2020-12-05 17:54:30

solution13
1 2021-11-09 06:17:14

solution14
0 2017-01-10 17:43:55

solution15
0

solution16
0 2020-06-07 15:19:37

solution17
0 2022-06-12 15:09:50

solution18
-2 2020-11-16 20:53:57