简体   繁体   中英

How to save array data frame output from spark xml in csv format

I have deleted two of my question because i thought i was too big and i could not explained it neatly .

So i am trying to make it simple this time .

So i have an complex nested xml . I am parsing it in spark scala and i have to save all the data from the xml into text file .

NOTE:I need to save the data into text files because later i have to join this data with another file which is in text format . Also can i join my csv file format with json or perquet file format ?If yes then i may not need to convert my xml into text file .

This is my code where i am trying to save the xml into csv file but as csv does not allow to save array type so i am getting error .

I am looking for some solution where i would be able to extarct all elements of an array and save it into text file .

def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("XML").setMaster("local");
    val sc = new SparkContext(conf); //Creating spark context
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:Body").load("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML")
    val resDf = df.withColumn("FlatType", explode(df("env:ContentItem"))).select("FlatType.*")

    resDf.repartition(1).write
      .format("csv")//This does not support for array Type
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("nullValue", "")
      .option("delimiter", "\t")
      .option("quote", "\u0000")
      .option("header", "true")
      .save("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML//output")

    // val resDf = df.withColumn("FlatType", when(df("env:ContentItem").isNotNull, explode(df("env:ContentItem"))))
  }

This is producing me below output before saving

+---------+--------------------+
|  _action|            env:Data|
+---------+--------------------+
|   Insert|[fun:FundamentalD...|
|Overwrite|[sr:FinancialSour...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[pe:FinancialPeri...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
|Overwrite|[fl:FinancialLine...|
+---------+--------------------+

Foe each unique env:Data i am expecting unique file that can be done using partition but how can i save it in text file .

I have to save all the elements from the array i mean all columns .

I hope this time i am making my question clear .

If required i can update schema also .

Spark SQL has a direct write to csv option. Why not use that?

Here is the syntax:

resDf.write.option("your options").csv("output file path")

This should save your file directly to csv format.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM