简体   繁体   中英

Convert multiple columns into a single column in dataframe

I have a scenario where i have to convert data in different columns to be displayed in one columns.

Below is the data available.

|BaseTime               |SGNL_NAME |SGNL_TIME              |SGNL_V|
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:16.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:16.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:16.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:17.645|0.0   |
|2019-11-21 18:19:15.817|Acc       |2019-11-21 18:18:17.645|0.0   |

the expected output is as below: where as a new column is created with combination of NAME, TIME and V as elements of an array.

"SGNL": [
            "SGNL_NAME ": "Acc       ",
            "SGNL_TIME ": 1574128316834,
            "SGNL_V": 0.0

|BaseTime               |SGNL                                                             |
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|
|2019-11-21 18:19:15.817|[{"SGNL_NAME": "Acc" ,"SGNL_TIME": 1574128316834,"SGNL_V": 0.0}]|

the schema of input is as given below

 |-- BaseTime: timestamp (nullable = true)
 |-- SGNL_NAME: string (nullable = true)
 |-- SGNL_TIME: timestamp (nullable = true)
 |-- SGNL_V: string (nullable = true)

I am trying with writing UDF to combine rows, Is there any other solutions available?

You can use to_JSON to convert multiple columns to JSON as shown below

val df = sc.parallelize(Seq(
     |   (32.0, 31.0, 14.0), (3.6, 2.8, 0.0), (4.5, 5.0, -1.2)
     | )).toDF

scala> df.show(10)
|  _1|  _2|  _3|
| 3.6| 2.8| 0.0|
| 4.5| 5.0|-1.2|

scala> df.select(to_json(struct($"_1", $"_2", $"_3"))).show(10)
|structstojson(named_struct(NamePlaceholder(), _1, NamePlaceholder(), _2, NamePlaceholder(), _3))|
|                                                                            {"_1":32.0,"_2":3...|
|                                                                            {"_1":3.6,"_2":2....|
|                                                                            {"_1":4.5,"_2":5....|

val DecimalType = DataTypes.createDecimalType(2, 1)

val schema = StructType(Seq(StructField("_1", DecimalType, true), StructField("_2", DecimalType, true), StructField("_3", DecimalType, true)))

new_df.withColumn("final_array", from_json($"final", schema)).show(10)

Hope this was useful.

scala> df.show(false)
|BaseTime              |SGNL_NAME|SGNL_TIME             |SGNL_V|
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:16.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:16.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:16.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:17.645|0.0   |
|2019-11-2118:19:15.817|Acc      |2019-11-2118:18:17.645|0.0   |

scala> val df1 =  df.withColumn("SGNL_NAME", regexp_replace(regexp_replace(to_json(struct("SGNL_NAME")), "\\{", ""),"\\}", ""))
                    .withColumn("SGNL_TIME", regexp_replace(regexp_replace(to_json(struct("SGNL_TIME")), "\\{", ""),"\\}", ""))
                    .withColumn("SGNL_V", regexp_replace(regexp_replace(to_json(struct("SGNL_V")), "\\{", ""),"\\}", ""))

scala> df1.show(false)
|BaseTime              |SGNL_NAME        |SGNL_TIME                           |SGNL_V        |

scala> val df2 = df1.withColumn("SGNL", struct("SGNL_NAME", "SGNL_TIME", "SGNL_V"))

scala> df2.show(false)
|BaseTime              |SGNL                                                                     |
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:16.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:17.645", "SGNL_V":"0.0"]|
|2019-11-2118:19:15.817|["SGNL_NAME":"Acc", "SGNL_TIME":"2019-11-2118:18:17.645", "SGNL_V":"0.0"]|

scala> df2.printSchema
 |-- BaseTime: string (nullable = true)
 |-- SGNL: struct (nullable = false)
 |    |-- SGNL_NAME: string (nullable = true)
 |    |-- SGNL_TIME: string (nullable = true)
 |    |-- SGNL_V: string (nullable = true)

An alternative to UDFs is to use the functions in the org.apache.spark.sql.functions package such as to_json() , struct() and array() . Here's a full working example:

val df = sc.parallelize(Seq(
  ("2019-11-21 18:19:15.817", "Acc", "2019-11-21 18:18:16.645", 0.0)
)).toDF("BaseTime", "SGNL_NAME", "SGNL_TIME", "SGNL_V")

val result = df.withColumn("SGNL", to_json(
    struct("SGNL_NAME", "SGNL_TIME", "SGNL_V")

result.show(false) gives your expected result:

|BaseTime               |SGNL                                                                    |
|2019-11-21 18:19:15.817|[{"SGNL_NAME":"Acc","SGNL_TIME":"2019-11-21 18:18:16.645","SGNL_V":0.0}]|

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM