[英]Parse XML string with spark xml in Scala
我有一个包含多列的数据框架,其中一列是XML。 我需要解析XML,同时还要保留其他列。
下面的代码根据需要解析XML。 但是,如何添加其他列?
import com.databricks.spark.xml.XmlReader
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
case class Data(id: String, code: Int, xmldata: String)
val df = Seq(
Data("123abc", 12345,"<XML><Date Depart=\"2019-06-30\" Arrive=\"2019-06-22\" /><Passengers><Passenger Age=\"ADT\" Quantity=\"1\" /><Passenger Age=\"CHD\" Quantity=\"1\" /></Passengers><Destination Code=\"LAX\"/></XML>")).toDF
val xrdd = df.select("xmldata").map(a => a.getString(0)).rdd
val xmldf = (new XmlReader()).xmlRdd(sqlContext, xrdd)
.select($"Date._Arrive".as("Arrive"),$"Date._Depart".as("Depart"),$"Destination._Code".as("Destination"),explode($"Passengers.Passenger").alias("Passenger"))
val selectedData = xmldf.select($"Arrive",$"Depart",$"Destination",$"Passenger._Age",$"Passenger._Quantity").show
退货
+----------+----------+-----------+----+---------+
| Arrive| Depart|Destination|_Age|_Quantity|
+----------+----------+-----------+----+---------+
|2019-06-22|2019-06-30| LAX| ADT| 1|
|2019-06-22|2019-06-30| LAX| CHD| 1|
+----------+----------+-----------+----+---------+
但是我想要的是以下内容(包括ID,原始数据框中的代码)
+----------+----------+----------+----------+-----------+----+---------+
| id| code | Arrive| Depart|Destination|_Age|_Quantity|
+----------+----------+----------+----------+-----------+----+---------+
|123abc |12345 |2019-06-22|2019-06-30| LAX| ADT| 1|
|123abc |12345 |2019-06-22|2019-06-30| LAX| CHD| 1|
+----------+----------+----------+----------+-----------+----+---------+
您可以尝试交叉连接DF和xmldf数据帧吗?
df.crossJoin(xmldf).select($"id",$"code",$"Arrive",$"Depart",$"Destination",$"Passenger._Age",$"Passenger._Quantity").show
糟糕,抱歉,我没有考虑过。 您可以尝试以下命令吗? 关键是,您需要从xml中选择适当的密钥。
case class Data(id: String, code: Int, xmldata: String)
val df = Seq(Data("123abc", 12345,"<XML><Date Depart=\"2019-06-30\" Arrive=\"2019-06-22\" /><Passengers><Passenger Age=\"ADT\" Quantity=\"1\" /><Passenger Age=\"CHD\" Quantity=\"1\" /></Passengers><Destination Code=\"LAX\"/></XML>"),Data("345xyz", 102030,"<XML><Date Depart=\"2019-07-30\" Arrive=\"2019-07-22\" /><Passengers><Passenger Age=\"BCD\" Quantity=\"2\" /><Passenger Age=\"APB\" Quantity=\"2\" /></Passengers><Destination Code=\"TX\"/></XML>"),Data("xxdf456", 201910,"<XML><Date Depart=\"2019-07-30\" Arrive=\"2019-07-22\" /><Passengers><Passenger Age=\"BCD\" Quantity=\"2\" /><Passenger Age=\"APB\" Quantity=\"2\" /></Passengers><Destination Code=\"TX\"/></XML>")).toDF
val xrdd = df.select("xmldata").map(x => x.getString(0)).rdd
val xmldf = (new XmlReader()).xmlRdd(spark.sqlContext,xrdd).select($"Date._Arrive".as("Arrive"),$"Date._Depart".as("Depart"),$"Destination._Code".as("Destination"),explode($"Passengers.Passenger").alias("Passenger")).select($"Arrive",$"Depart",$"Destination",$"Passenger._Age",$"Passenger._Quantity")
val w = Window.orderBy("xmldata")
val dfx=df.withColumn("id1",dense_rank.over(w))
val w1 = Window.orderBy($"Arrive",$"Depart",$"Destination")
val xmldfx =xmldf.withColumn("id1",dense_rank.over(w1))
dfx.alias("a").join(xmldfx.alias("b"), $"a.id1" === $"b.id1").select($"a.id",$"a.code",$"b.Arrive",$"b.Depart",$"b.Destination",$"b._Age",$"b._Quantity").orderBy("id").show
这里的第二和第三xml列是相同的。 所以输出将是
+-------+------+----------+----------+-----------+----+---------+
| id| code| Arrive| Depart|Destination|_Age|_Quantity|
+-------+------+----------+----------+-----------+----+---------+
| 123abc| 12345|2019-06-22|2019-06-30| LAX| ADT| 1|
| 123abc| 12345|2019-06-22|2019-06-30| LAX| CHD| 1|
| 345xyz|102030|2019-07-22|2019-07-30| TX| BCD| 2|
| 345xyz|102030|2019-07-22|2019-07-30| TX| APB| 2|
| 345xyz|102030|2019-07-22|2019-07-30| TX| APB| 2|
| 345xyz|102030|2019-07-22|2019-07-30| TX| BCD| 2|
|xxdf456|201910|2019-07-22|2019-07-30| TX| BCD| 2|
|xxdf456|201910|2019-07-22|2019-07-30| TX| APB| 2|
|xxdf456|201910|2019-07-22|2019-07-30| TX| BCD| 2|
|xxdf456|201910|2019-07-22|2019-07-30| TX| APB| 2|
+-------+------+----------+----------+-----------+----+---------+
具有唯一的xml列。
val df = Seq(
Data("123abc", 12345,"<XML><Date Depart=\"2019-06-30\" Arrive=\"2019-06-22\" /><Passengers><Passenger Age=\"ADT\" Quantity=\"1\" /><Passenger Age=\"CHD\" Quantity=\"1\" /></Passengers><Destination Code=\"LAX\"/></XML>"),Data("345xyz", 102030,"<XML><Date Depart=\"2019-07-30\" Arrive=\"2019-07-22\" /><Passengers><Passenger Age=\"BCD\" Quantity=\"2\" /><Passenger Age=\"APB\" Quantity=\"2\" /></Passengers><Destination Code=\"TX\"/></XML>"),Data("xxdf456", 201910,"<XML><Date Depart=\"2019-08-10\" Arrive=\"2019-08-22\" /><Passengers><Passenger Age=\"yyy\" Quantity=\"10\" /><Passenger Age=\"xxx\" Quantity=\"10\" /></Passengers><Destination Code=\"TX\"/></XML>")).toDF
输出:
+-------+------+----------+----------+-----------+----+---------+
| id| code| Arrive| Depart|Destination|_Age|_Quantity|
+-------+------+----------+----------+-----------+----+---------+
| 123abc| 12345|2019-06-22|2019-06-30| LAX| CHD| 1|
| 123abc| 12345|2019-06-22|2019-06-30| LAX| ADT| 1|
| 345xyz|102030|2019-07-22|2019-07-30| TX| BCD| 2|
| 345xyz|102030|2019-07-22|2019-07-30| TX| APB| 2|
|xxdf456|201910|2019-08-22|2019-08-10| TX| yyy| 10|
|xxdf456|201910|2019-08-22|2019-08-10| TX| xxx| 10|
+-------+------+----------+----------+-----------+----+---------+
很抱歉,内容冗长。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.