简体   繁体   English

将Spark数据框中的每个分区记录写入xml文件

[英]Write records per partition in spark data frame to a xml file

I have to do the records count in a file per partition in spark data frame and then I have to write output to XML file. 我必须在spark数据帧中每个分区的文件中进行记录计数,然后将输出写入XML文件。

Here is my data frame. 这是我的数据框。

dfMainOutputFinalWithoutNull.coalesce(1).write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("header", "true")
  .option("codec", "gzip")
  .save("s3://trfsdisu/SPARK/FinancialLineItem/output")

Now I have to count the number of records in each file in each partition and then write output to an XML file. 现在,我必须计算每个分区中每个文件中的记录数,然后将输出写入XML文件。

This is how I am trying to do it. 这就是我试图做到的。

val count =dfMainOutputFinalWithoutNull.groupBy("DataPartition","StatementTypeCode").count

  count.write.format("com.databricks.spark.xml")
  .option("rootTag", "items")
  .option("rowTag", "item")
  .save("s3://trfsdisu/SPARK/FinancialLineItem/Descr")

I am able to print total no of records per partition and print that but when im trying to create xml file i am getting below error . 我能够打印每个分区的总记录数并进行打印,但是当我尝试创建xml文件时,我遇到了以下错误。

java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html

I am using Spark 2.2.0, Zeppelin 0.7.2 我正在使用Spark 2.2.0, Zeppelin 0.7.2

So do I have to import com.databricks.spark.xml this, but why because in case of csv file if I am not importing com.databricks.spark.csv . 所以我必须要导入com.databricks.spark.xml ,但是为什么要使用csv文件,如果我不导入com.databricks.spark.csv

Also, can I use cache dfMainOutputFinalWithoutNull because I will be using it twice to write its data and then count its partitions records and then write in the xml file? 另外,是否可以使用缓存dfMainOutputFinalWithoutNull因为我将使用它两次来写入其数据,然后计算其分区记录,然后写入xml文件?

And I added this dependency 我添加了这种依赖

  <!-- https://mvnrepository.com/artifact/com.databricks/spark-xml_2.10 -->
<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-xml_2.10</artifactId>
    <version>0.2.0</version>
</dependency>

And restarted interpreter. 并重新启动解释器。 Then I got the following error. 然后我得到了以下错误。

java.lang.NullPointerException
    at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
    at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33)
    at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:391)
    at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:380)
    at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:146)

I will answer my question 我会回答我的问题

so i added below dependency in zepplin 所以我在zepplin中添加了以下依赖项

Scala 2.11

groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.4.1

Added below in the zepplin 在下面的Zepplin中添加

com.databricks:spark-xml_2.11:0.4.1

And then i was able to create files . 然后我就可以创建文件了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM