简体   繁体   中英

Convert multiple columns into a single column in dataframe

I have a scenario where i have to convert data in different columns to be displayed in one columns.

Below is the data available.

+-----------------------+----------+-----------------------+------+
|BaseTime               |SGNL_NAME |SGNL_TIME              |SGNL_V|
+-----------------------+----------+-----------------------+------+
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:16.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:16.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:16.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:17.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:17.645|0.0   |
+-----------------------+----------+-----------------------+------+

the expected output is as below: where as a new column is created with combination of NAME, TIME and V as elements of an array.

"SGNL": [
        {
            "SGNL_NAME ": "Acc       ",
            "SGNL_TIME ": 1574128316834,
            "SGNL_V": 0.0
        }
       ]


+-----------------------+-----------------------------------------------------------------+
|BaseTime               |SGNL                                                             |
+-----------------------+-----------------------------------------------------------------+
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
+-----------------------------------------------------------------------------------------+

the schema of input is as given below

root
 |-- BaseTime: timestamp (nullable = true)
 |-- SGNL_NAME: string (nullable = true)
 |-- SGNL_TIME: timestamp (nullable = true)
 |-- SGNL_V: string (nullable = true)

I am trying with writing UDF to combine rows, Is there any other solutions available?

You can use to_JSON to convert multiple columns to JSON as shown below

val df = sc.parallelize(Seq(
     |   (32.0, 31.0, 14.0), (3.6, 2.8, 0.0), (4.5, 5.0, -1.2)
     | )).toDF


scala> df.show(10)
+----+----+----+                                                                
|  _1|  _2|  _3|
+----+----+----+
|32.0|31.0|14.0|
| 3.6| 2.8| 0.0|
| 4.5| 5.0|-1.2|
+----+----+----+

scala> df.select(to_json(struct($"_1", $"_2", $"_3"))).show(10)
+------------------------------------------------------------------------------------------------+
|structstojson(named_struct(NamePlaceholder(), _1, NamePlaceholder(), _2, NamePlaceholder(), _3))|
+------------------------------------------------------------------------------------------------+
|                                                                            {"_1":32.0,"_2":3...|
|                                                                            {"_1":3.6,"_2":2....|
|                                                                            {"_1":4.5,"_2":5....|
+------------------------------------------------------------------------------------------------+

val DecimalType = DataTypes.createDecimalType(2, 1)

val schema = StructType(Seq(StructField("_1", DecimalType, true), StructField("_2", DecimalType, true), StructField("_3", DecimalType, true)))

new_df.withColumn("final_array", from_json($"final", schema)).show(10)

Hope this was useful.

scala> df.show(false)
+----------------------+---------+----------------------+------+
|BaseTime              |SGNL_NAME|SGNL_TIME             |SGNL_V|
+----------------------+---------+----------------------+------+
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:16.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:16.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:16.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:17.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:17.645|0.0   |
+----------------------+---------+----------------------+------+


scala> val df1 =  df.withColumn("SGNL_NAME", regexp_replace(regexp_replace(to_json(struct("SGNL_NAME")), "\\{", ""),"\\}", ""))
                    .withColumn("SGNL_TIME", regexp_replace(regexp_replace(to_json(struct("SGNL_TIME")), "\\{", ""),"\\}", ""))
                    .withColumn("SGNL_V", regexp_replace(regexp_replace(to_json(struct("SGNL_V")), "\\{", ""),"\\}", ""))


scala> df1.show(false)
+----------------------+-----------------+------------------------------------+--------------+
|BaseTime              |SGNL_NAME        |SGNL_TIME                           |SGNL_V        |
+----------------------+-----------------+------------------------------------+--------------+
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:16.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:16.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:16.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:17.645"|"SGNL_V":"0.0"|
|2019-11-2118:19:15.817|"SGNL_NAME":"Acc"|"SGNL_TIME":"2019-11-2118:18:17.645"|"SGNL_V":"0.0"|
+----------------------+-----------------+------------------------------------+--------------+


scala> val df2 = df1.withColumn("SGNL", struct("SGNL_NAME", "SGNL_TIME", "SGNL_V"))
                     .drop("SGNL_NAME","SGNL_TIME","SGNL_V")

scala> df2.show(false)
+----------------------+-------------------------------------------------------------------------+
|BaseTime              |SGNL                                                                     |
+----------------------+-------------------------------------------------------------------------+
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:17.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:17.645", "SGNL_V":"0.0"]|
+----------------------+-------------------------------------------------------------------------+


scala> df2.printSchema
root
 |-- BaseTime: string (nullable = true)
 |-- SGNL: struct (nullable = false)
 |    |-- SGNL_NAME: string (nullable = true)
 |    |-- SGNL_TIME: string (nullable = true)
 |    |-- SGNL_V: string (nullable = true)

An alternative to UDFs is to use the functions in the org.apache.spark.sql.functions package such as to_json() , struct() and array() . Here's a full working example:

val df = sc.parallelize(Seq(
  ("2019-11-21 18:19:15.817", "Acc", "2019-11-21 18:18:16.645", 0.0)
)).toDF("BaseTime", "SGNL_NAME", "SGNL_TIME", "SGNL_V")

val result = df.withColumn("SGNL", to_json(
  array(
    struct("SGNL_NAME", "SGNL_TIME", "SGNL_V")
  )
)).drop("SGNL_NAME","SGNL_TIME","SGNL_V")

result.show(false) gives your expected result:

+-----------------------+------------------------------------------------------------------------+
|BaseTime               |SGNL                                                                    |
+-----------------------+------------------------------------------------------------------------+
|2019-11-21 18:19:15.817|[{"SGNL_NAME":"Acc","SGNL_TIME":"2019-11-21 18:18:16.645","SGNL_V":0.0}]|
+-----------------------+------------------------------------------------------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM