I have deleted two of my question because i thought i was too big and i could not explained it neatly .
So i am trying to make it simple this time .
So i have an complex nested xml . I am parsing it in spark scala and i have to save all the data from the xml into text file .
NOTE:I need to save the data into text files because later i have to join this data with another file which is in text format . Also can i join my csv file format with json or perquet file format ?If yes then i may not need to convert my xml into text file .
This is my code where i am trying to save the xml into csv file but as csv does not allow to save array type so i am getting error .
I am looking for some solution where i would be able to extarct all elements of an array and save it into text file .
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("XML").setMaster("local");
val sc = new SparkContext(conf); //Creating spark context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:Body").load("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML")
val resDf = df.withColumn("FlatType", explode(df("env:ContentItem"))).select("FlatType.*")
resDf.repartition(1).write
.format("csv")//This does not support for array Type
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.save("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML//output")
// val resDf = df.withColumn("FlatType", when(df("env:ContentItem").isNotNull, explode(df("env:ContentItem"))))
}
This is producing me below output before saving
+---------+--------------------+
| _action| env:Data|
+---------+--------------------+
| Insert|[fun:FundamentalD...|
|Overwrite|[sr:FinancialSour...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
+---------+--------------------+
Foe each unique env:Data
i am expecting unique file that can be done using partition but how can i save it in text file .
I have to save all the elements from the array i mean all columns .
I hope this time i am making my question clear .
If required i can update schema also .
Spark SQL has a direct write to csv option. Why not use that?
Here is the syntax:
resDf.write.option("your options").csv("output file path")
This should save your file directly to csv format.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.