简体   繁体   中英

Apache Spark XML into JavaRDD

I have tried to read a xml file with spark and turn it into a JavaRDD array. I have read about how to turn it into a DataSet but I wanted to know if it is possible with JavaRDD. I have to mention that in my xml file I have a list which is not always the same size. Here is an example of my XML file.

 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
<logs>
    <log>
        <id>1</id>
        <clientId>1</clientId>
        <date>Wed Apr 03 21:16:18 EEST 2019</date>
        <itemList>
            <item>2</item>
        </itemList>
    </log>
    <log>
        <id>2</id>
        <clientId>2</clientId>
        <date>Wed Apr 03 21:16:19 EEST 2019</date>
        <itemList>
            <item>1</item>
            <item>2</item>
            <item>3</item>
        </itemList>
    </log>
</logs>

Thanks!

Here is a possible solution: https://github.com/databricks/spark-xml/issues/213

Here is what you need:

import com.databricks.spark.xml.XmlReader

val rdd = sc.parallelize(Seq("<books><book>book1</book><book>book2</book></books>"))
val df = new XmlReader().xmlRdd(spark.sqlContext, rdd)
df.show

+--------------+
|          book|
+--------------+
|[book1, book2]|
+--------------+

df.printSchema

root
 |-- book: array (nullable = true)
 |    |-- element: string (containsNull = true)

from rdd to JavaRDD is fairly simple. (wrapRdd, look in documentation).

I hope it answered your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM