繁体   English   中英

在Scala中使用Spark xml解析XML字符串

[英]Parse XML string with spark xml in Scala

我有一个包含多列的数据框架,其中一列是XML。 我需要解析XML,同时还要保留其他列。

下面的代码根据需要解析XML。 但是,如何添加其他列?

import com.databricks.spark.xml.XmlReader
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._  
import org.apache.spark.sql.types._  

case class Data(id: String, code: Int, xmldata: String)
val df = Seq(
    Data("123abc", 12345,"<XML><Date Depart=\"2019-06-30\" Arrive=\"2019-06-22\" /><Passengers><Passenger Age=\"ADT\" Quantity=\"1\" /><Passenger Age=\"CHD\" Quantity=\"1\" /></Passengers><Destination Code=\"LAX\"/></XML>")).toDF

val xrdd = df.select("xmldata").map(a => a.getString(0)).rdd

val xmldf = (new XmlReader()).xmlRdd(sqlContext, xrdd)
.select($"Date._Arrive".as("Arrive"),$"Date._Depart".as("Depart"),$"Destination._Code".as("Destination"),explode($"Passengers.Passenger").alias("Passenger"))

val selectedData = xmldf.select($"Arrive",$"Depart",$"Destination",$"Passenger._Age",$"Passenger._Quantity").show

退货

+----------+----------+-----------+----+---------+
|    Arrive|    Depart|Destination|_Age|_Quantity|
+----------+----------+-----------+----+---------+
|2019-06-22|2019-06-30|        LAX| ADT|        1|
|2019-06-22|2019-06-30|        LAX| CHD|        1|
+----------+----------+-----------+----+---------+

但是我想要的是以下内容(包括ID,原始数据框中的代码)

+----------+----------+----------+----------+-----------+----+---------+
|        id|    code  |    Arrive|    Depart|Destination|_Age|_Quantity|
+----------+----------+----------+----------+-----------+----+---------+
|123abc    |12345     |2019-06-22|2019-06-30|        LAX| ADT|        1|
|123abc    |12345     |2019-06-22|2019-06-30|        LAX| CHD|        1|
+----------+----------+----------+----------+-----------+----+---------+

您可以尝试交叉连接DF和xmldf数据帧吗?

df.crossJoin(xmldf).select($"id",$"code",$"Arrive",$"Depart",$"Destination",$"Passenger._Age",$"Passenger._Quantity").show

谢谢,

糟糕,抱歉,我没有考虑过。 您可以尝试以下命令吗? 关键是,您需要从xml中选择适当的密钥。

case class Data(id: String, code: Int, xmldata: String)
val df = Seq(Data("123abc", 12345,"<XML><Date Depart=\"2019-06-30\" Arrive=\"2019-06-22\" /><Passengers><Passenger Age=\"ADT\" Quantity=\"1\" /><Passenger Age=\"CHD\" Quantity=\"1\" /></Passengers><Destination Code=\"LAX\"/></XML>"),Data("345xyz", 102030,"<XML><Date Depart=\"2019-07-30\" Arrive=\"2019-07-22\" /><Passengers><Passenger Age=\"BCD\" Quantity=\"2\" /><Passenger Age=\"APB\" Quantity=\"2\" /></Passengers><Destination Code=\"TX\"/></XML>"),Data("xxdf456", 201910,"<XML><Date Depart=\"2019-07-30\" Arrive=\"2019-07-22\" /><Passengers><Passenger Age=\"BCD\" Quantity=\"2\" /><Passenger Age=\"APB\" Quantity=\"2\" /></Passengers><Destination Code=\"TX\"/></XML>")).toDF

val xrdd = df.select("xmldata").map(x => x.getString(0)).rdd
val xmldf = (new XmlReader()).xmlRdd(spark.sqlContext,xrdd).select($"Date._Arrive".as("Arrive"),$"Date._Depart".as("Depart"),$"Destination._Code".as("Destination"),explode($"Passengers.Passenger").alias("Passenger")).select($"Arrive",$"Depart",$"Destination",$"Passenger._Age",$"Passenger._Quantity")

val w = Window.orderBy("xmldata")
val dfx=df.withColumn("id1",dense_rank.over(w))

val w1 = Window.orderBy($"Arrive",$"Depart",$"Destination")
val xmldfx =xmldf.withColumn("id1",dense_rank.over(w1))

dfx.alias("a").join(xmldfx.alias("b"), $"a.id1" === $"b.id1").select($"a.id",$"a.code",$"b.Arrive",$"b.Depart",$"b.Destination",$"b._Age",$"b._Quantity").orderBy("id").show

这里的第二和第三xml列是相同的。 所以输出将是

+-------+------+----------+----------+-----------+----+---------+
|     id|  code|    Arrive|    Depart|Destination|_Age|_Quantity|
+-------+------+----------+----------+-----------+----+---------+
| 123abc| 12345|2019-06-22|2019-06-30|        LAX| ADT|        1|
| 123abc| 12345|2019-06-22|2019-06-30|        LAX| CHD|        1|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| BCD|        2|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| APB|        2|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| APB|        2|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| BCD|        2|
|xxdf456|201910|2019-07-22|2019-07-30|         TX| BCD|        2|
|xxdf456|201910|2019-07-22|2019-07-30|         TX| APB|        2|
|xxdf456|201910|2019-07-22|2019-07-30|         TX| BCD|        2|
|xxdf456|201910|2019-07-22|2019-07-30|         TX| APB|        2|
+-------+------+----------+----------+-----------+----+---------+

具有唯一的xml列。

val df = Seq(
    Data("123abc", 12345,"<XML><Date Depart=\"2019-06-30\" Arrive=\"2019-06-22\" /><Passengers><Passenger Age=\"ADT\" Quantity=\"1\" /><Passenger Age=\"CHD\" Quantity=\"1\" /></Passengers><Destination Code=\"LAX\"/></XML>"),Data("345xyz", 102030,"<XML><Date Depart=\"2019-07-30\" Arrive=\"2019-07-22\" /><Passengers><Passenger Age=\"BCD\" Quantity=\"2\" /><Passenger Age=\"APB\" Quantity=\"2\" /></Passengers><Destination Code=\"TX\"/></XML>"),Data("xxdf456", 201910,"<XML><Date Depart=\"2019-08-10\" Arrive=\"2019-08-22\" /><Passengers><Passenger Age=\"yyy\" Quantity=\"10\" /><Passenger Age=\"xxx\" Quantity=\"10\" /></Passengers><Destination Code=\"TX\"/></XML>")).toDF

输出:

+-------+------+----------+----------+-----------+----+---------+
|     id|  code|    Arrive|    Depart|Destination|_Age|_Quantity|
+-------+------+----------+----------+-----------+----+---------+
| 123abc| 12345|2019-06-22|2019-06-30|        LAX| CHD|        1|
| 123abc| 12345|2019-06-22|2019-06-30|        LAX| ADT|        1|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| BCD|        2|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| APB|        2|
|xxdf456|201910|2019-08-22|2019-08-10|         TX| yyy|       10|
|xxdf456|201910|2019-08-22|2019-08-10|         TX| xxx|       10|
+-------+------+----------+----------+-----------+----+---------+

很抱歉,内容冗长。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM