简体   繁体   English

在 Databricks 中将 XML 字符串转换为 Spark Dataframe

[英]Converting XML string to Spark Dataframe in Databricks

how can I build a Spark dataframe from a string which contains XML code?如何从包含 XML 代码的字符串构建 Spark dataframe?

I can easily do it, if the code is saved in a file如果代码保存在文件中,我可以轻松做到

dfXml = (sqlContext.read.format("xml")
           .options(rowTag='my_row_tag')
           .load(xml_file_name))

However as said I have to build the dataframe from a string which contains regular XML.但是,如前所述,我必须从包含常规 XML 的字符串构建 dataframe。

Thank you谢谢

Mauro毛罗

You can parse xml string without spark xml connector.您可以在没有火花 xml 连接器的情况下解析 xml 字符串。 Using below udf, You can convert xml string into json & then do your transformations on that.使用下面的 udf,您可以将 xml 字符串转换为 json 然后对其进行转换。

I have taken one sample xml string & stored in catalog.xml file.我已经采集了一个样本 xml 字符串并存储在 catalog.xml 文件中。

/tmp> cat catalog.xml
<?xml version="1.0"?><catalog><book id="bk101"><author>Gambardella, Matthew</author><title>XML Developer's Guide</title><genre>Computer</genre><price>44.95</price><publish_date>2000-10-01</publish_date><description>An in-depth look at creating applications with XML.</description></book></catalog>
<?xml version="1.0"?><catalog><book id="bk102"><author>Ralls, Kim</author><title>Midnight Rain</title><genre>Fantasy</genre><price>5.95</price><publish_date>2000-12-16</publish_date><description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description></book></catalog>


Please note below code is in scala, This will help you to implement same logic in python.请注意下面的代码在 scala 中,这将帮助您在 python 中实现相同的逻辑。

scala> val df = spark.read.textFile("/tmp/catalog.xml")
df: org.apache.spark.sql.Dataset[String] = [value: string]

scala> import org.json4s.Xml.toJson
import org.json4s.Xml.toJson

scala> import org.json4s.jackson.JsonMethods.{compact, parse}
import org.json4s.jackson.JsonMethods.{compact, parse}

scala> :paste
// Entering paste mode (ctrl-D to finish)

implicit class XmlToJson(data: String) {
    def json(root: String) = compact {
      toJson(scala.xml.XML.loadString(data)).transformField {
        case (field,value) => (field.toLowerCase,value)
      } \ root.toLowerCase
    }
    def json = compact(parse(data))
  }

val parseUDF = udf { (data: String,xmlRoot: String) => data.json(xmlRoot.toLowerCase)}


// Exiting paste mode, now interpreting.

defined class XmlToJson
parseUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))

scala> val json = df.withColumn("value",parseUDF($"value",lit("catalog")))
json: org.apache.spark.sql.DataFrame = [value: string]

scala> val json = df.withColumn("value",parseUDF($"value",lit("catalog"))).select("value").map(_.getString(0))
json: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val bookDF = spark.read.json(json).select("book.*")
bookDF: org.apache.spark.sql.DataFrame = [author: string, description: string ... 5 more fields]

scala> bookDF.printSchema
root
 |-- author: string (nullable = true)
 |-- description: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- id: string (nullable = true)
 |-- price: string (nullable = true)
 |-- publish_date: string (nullable = true)
 |-- title: string (nullable = true)


scala> bookDF.show(false)
+--------------------+--------------------------------------------------------------------------------------------------------------------+--------+-----+-----+------------+---------------------+
|author              |description                                                                                                         |genre   |id   |price|publish_date|title                |
+--------------------+--------------------------------------------------------------------------------------------------------------------+--------+-----+-----+------------+---------------------+
|Gambardella, Matthew|An in-depth look at creating applications with XML.                                                                 |Computer|bk101|44.95|2000-10-01  |XML Developer's Guide|
|Ralls, Kim          |A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.|Fantasy |bk102|5.95 |2000-12-16  |Midnight Rain        |
+--------------------+--------------------------------------------------------------------------------------------------------------------+--------+-----+-----+------------+---------------------+

On Scala, class "XmlReader" can be used for convert RDD[String] to DataFrame:在 Scala、class "XmlReader" 可用于将 RDD[String] 转换为 DataFrame:

    val result = new XmlReader().xmlRdd(spark, rdd)

If you have Dataframe as input, it can be converted to RDD[String] easily.如果你有 Dataframe 作为输入,它可以很容易地转换为 RDD[String]。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM