简体   繁体   中英

How to read strings from scala spark-shell Array[String]

I have a XML file that I'm trying to process through Spark-Shell using Scala. I am stuck at a point where I need to read the Array[String] using Scala's

scala> val fileRead = sc.textFile("source_file")
fileRead: org.apache.spark.rdd.RDD[String] = source_file MapPartitionsRDD[8] at textFile at <console>:21

scala> val strLines = fileRead.map(x => x.toString)
strLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:23

scala> val fltrLines = strLines.filter(_.contains("<record column1="))
fltrLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:25

scala> fltrLines.take(5)
res1: Array[String] = Array("<record column1="1" column2="1" column3="5" column4="2010-11-02T18:59:01.140" />", "<record column2=....

I need to read this value of the Array[String]:

"<record column1="1" column2="1" column3="5" column4="2010-11-02T18:59:01.140" />"

as XML so that I can use Scala Elem and NodeSeq classes to extract the data. So I want to do something like:

val xmlLines = fltrLines.....somehow get the value of the value of Array[String] first index

And then use xmlLines.attributes, etc.

You can do fltrLines.map { scala.xml.XML.loadString _ } , which should build Elem s out of Strings. Check the docs , notice though that this is an old Scaladoc, when Scala std. lib. still contained XML, these days it resides in a separate jar file. So, if you are using a newer version, make sure to put the right jar in your classpath.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM