[英]Spark Dataframe from a different data format
I've this data set.我有这个数据集。 for which I need to create a sparkdataframe in scala.为此,我需要在 Scala 中创建一个 sparkdataframe。 This data is a column in a csv file.此数据是 csv 文件中的一列。 column name is dataheader列名是数据头
dataheader数据头
"{""date_time"":""1999/05/22 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""113"",""activityTimeWindowMilliseconds"":20000,""ec"":""event1"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
"{""date_time"":""1999/05/23 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""114"",""activityTimeWindowMilliseconds"":20000,""ec"":""event2"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
I was able to read the csv file -我能够读取 csv 文件 -
val df_tmp = spark
.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("quoteMode", "ALL")
.option("delimiter", ",")
.option("escape", "\"")
//.option("inferSchema","true")
.option("multiline", "true")
.load("D:\\dataFile.csv")
I tried to split the data into separate columns in a dataframe but did not succeed.我试图将数据拆分为数据框中的单独列,但没有成功。
one thing I noticed in data is both key and value are enclosed by double double quotes ""key1"":""value1""
我在数据中注意到的一件事是键和值都用双双引号""key1"":""value1""
括起来
If you want to get the field inside the data field, you need to parse it and write it into a new CSV file.如果要获取数据字段内的字段,则需要对其进行解析并将其写入新的 CSV 文件。 It's obviously a string in json format明明是json格式的字符串
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.