来自不同数据格式的 Spark Dataframe

Question

I've this data set.我有这个数据集。 for which I need to create a sparkdataframe in scala.为此，我需要在 Scala 中创建一个 sparkdataframe。 This data is a column in a csv file.此数据是 csv 文件中的一列。 column name is dataheader列名是数据头

dataheader数据头

"{""date_time"":""1999/05/22 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""113"",""activityTimeWindowMilliseconds"":20000,""ec"":""event1"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
"{""date_time"":""1999/05/23 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""114"",""activityTimeWindowMilliseconds"":20000,""ec"":""event2"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"

I was able to read the csv file -我能够读取 csv 文件 -

val df_tmp = spark
                    .read
                    .format("com.databricks.spark.csv")
                    .option("header","true")
                    .option("quoteMode", "ALL")
                    .option("delimiter", ",")
                    .option("escape", "\"")
                    //.option("inferSchema","true")
                    .option("multiline", "true")
                    .load("D:\\dataFile.csv")

I tried to split the data into separate columns in a dataframe but did not succeed.我试图将数据拆分为数据框中的单独列，但没有成功。

one thing I noticed in data is both key and value are enclosed by double double quotes ""key1"":""value1""我在数据中注意到的一件事是键和值都用双双引号""key1"":""value1""括起来

Answer 1

If you want to get the field inside the data field, you need to parse it and write it into a new CSV file.如果要获取数据字段内的字段，则需要对其进行解析并将其写入新的 CSV 文件。 It's obviously a string in json format明明是json格式的字符串

来自不同数据格式的 Spark Dataframe

问题描述

1 个解决方案

解决方案1
0 2019-12-04 08:04:39

来自不同数据格式的 Spark Dataframe

问题描述

1 个解决方案

解决方案1 0 2019-12-04 08:04:39

解决方案1
0 2019-12-04 08:04:39