繁体   English   中英

Spark-Shell Scala XML如何连接属性

[英]Spark-Shell Scala XML how to concatenate attributes

我试图用逗号分隔符连接Scala中的XML属性。

scala> val fileRead = sc.textFile("source_file")
fileRead: org.apache.spark.rdd.RDD[String] = source_file MapPartitionsRDD[8] at textFile at <console>:21

scala> val strLines = fileRead.map(x => x.toString)
strLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:23

scala> val fltrLines = strLines.filter(_.contains("<record column1="))
fltrLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:25

scala> fltrLines.take(5)
res1: Array[String] = Array("<record column1="Hello" column2="there" column3="how" column4="are you?" />", "<record column1=...."

scala> val elem = fltrLines.map{ scala.xml.XML.loadString _ }
elem: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[34] at map at <console>:27

这是我需要用逗号,然后是列2,然后是逗号,然后是column3进行column1串联的地方。实际上,我也希望能够更改column3,column1,column2 ...的顺序。

scala> val attr = elem.map(_.attributes("column1"))
attr: org.apache.spark.rdd.RDD[Seq[scala.xml.Node]] = MapPartitionsRDD[35] at map at <console>:29

现在是这样的:

scala> attr.take(1)
res17: Array[String] = Array(Hello)

我需要这个:

scala> attr.take(1)
res17: Array[String] = Array(Hello, there, how, are you?)

或者,如果我觉得这样:

scala> attr.take(1)
res17: Array[String] = Array(are you?, there, Hello)

这将做您想要的。 您可以获取属性列表并对其进行排序,但是请注意,只有当您的XML记录具有相同的column1, column2,属性时,该列表才有效。

scala> elem.map { r =>
   // get all attributes (columnN) and sort them  
   r.attributes.map {_.key}.toSeq.sorted.
     // get the values and convert from Node to String
     map { r.attributes(_).toString} // .toArray here if you want
                                     // Array here instead of List
          }.head
res33: Array[String] = List(Hello, there, how, are you?)

这就是它对我的工作方式。 我像以前一样将行设置为scala.xml.Elem

scala> val fileRead = sc.textFile("source_file")
fileRead: org.apache.spark.rdd.RDD[String] = source_file MapPartitionsRDD[8] at textFile at <console>:21

scala> val strLines = fileRead.map(x => x.toString)
strLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:23

scala> val fltrLines = strLines.filter(_.contains("<record column1="))
fltrLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:25

scala> fltrLines.take(5)
res1: Array[String] = Array("<record column1="Hello" column2="there" column3="how" column4="are you?" />", "<record column1=...."

scala> val elem = fltrLines.map{ scala.xml.XML.loadString _ }
elem: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[34] at map at <console>:27

但是这一次,我没有使用attributes("AttributeName")方法,而是使用attributes.asAttrMap ,这给了我Map[String,String] = Map(Key1 -> Value1, Key2 -> Value2, ....)类型:

scala> val mappedElem = elem.map(_.attributes.asAttrMap)

然后,我指定了自己的列顺序。 这样,如果不存在列或XML大小写的属性,则数据将仅显示null 我可以将null更改为我想要的任何内容:

val myVals = mappedElem.map { x => x.getOrElse("Column3", null) + ", " + x.getOrElse("Column1", null) }

所以这就是我要获取列的随机顺序所要做的,将其转换为逗号分隔的文件时,可以将其更改为XML文件中的列位置。

输出为:

how, Hello

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM