简体   繁体   中英

Spark-Shell Scala XML how to concatenate attributes

I am trying to concatenate the XML attributes in Scala with a comma separator.

scala> val fileRead = sc.textFile("source_file")
fileRead: org.apache.spark.rdd.RDD[String] = source_file MapPartitionsRDD[8] at textFile at <console>:21

scala> val strLines = fileRead.map(x => x.toString)
strLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:23

scala> val fltrLines = strLines.filter(_.contains("<record column1="))
fltrLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:25

scala> fltrLines.take(5)
res1: Array[String] = Array("<record column1="Hello" column2="there" column3="how" column4="are you?" />", "<record column1=...."

scala> val elem = fltrLines.map{ scala.xml.XML.loadString _ }
elem: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[34] at map at <console>:27

This is where I need to do the concatenation of column1 with comma, then column 2, then comma, then column3... In fact, I want to be able to change the order like column3, column1, column2... as well.

scala> val attr = elem.map(_.attributes("column1"))
attr: org.apache.spark.rdd.RDD[Seq[scala.xml.Node]] = MapPartitionsRDD[35] at map at <console>:29

Here's what it looks like right now:

scala> attr.take(1)
res17: Array[String] = Array(Hello)

I need this:

scala> attr.take(1)
res17: Array[String] = Array(Hello, there, how, are you?)

Or this, if I feel like it:

scala> attr.take(1)
res17: Array[String] = Array(are you?, there, Hello)

This will do what you want. You can get the list of attributes and sort it, but noticed that it will work only if your XML records have all the same column1, column2, attributes.

scala> elem.map { r =>
   // get all attributes (columnN) and sort them  
   r.attributes.map {_.key}.toSeq.sorted.
     // get the values and convert from Node to String
     map { r.attributes(_).toString} // .toArray here if you want
                                     // Array here instead of List
          }.head
res33: Array[String] = List(Hello, there, how, are you?)

So here's how it worked for me. I set up my lines as scala.xml.Elem just like I did before:

scala> val fileRead = sc.textFile("source_file")
fileRead: org.apache.spark.rdd.RDD[String] = source_file MapPartitionsRDD[8] at textFile at <console>:21

scala> val strLines = fileRead.map(x => x.toString)
strLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:23

scala> val fltrLines = strLines.filter(_.contains("<record column1="))
fltrLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:25

scala> fltrLines.take(5)
res1: Array[String] = Array("<record column1="Hello" column2="there" column3="how" column4="are you?" />", "<record column1=...."

scala> val elem = fltrLines.map{ scala.xml.XML.loadString _ }
elem: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[34] at map at <console>:27

But this time instead of using the attributes("AttributeName") method, I used attributes.asAttrMap , which gave me a Map[String,String] = Map(Key1 -> Value1, Key2 -> Value2, ....) type:

scala> val mappedElem = elem.map(_.attributes.asAttrMap)

Then I specified my own order of column. That way if a column, or an attribute in XML case, doesn't exist, the data will just show null . I can change null to anything I want:

val myVals = mappedElem.map { x => x.getOrElse("Column3", null) + ", " + x.getOrElse("Column1", null) }

So that's what I had to do to get random order of columns, you can call it changing column positions in an XML file when transforming it to a comma delimited file.

The output was then:

how, Hello

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM