使用Scala的io.Text RDD上的正则表达式

Question

I have a problem. 我有个问题。 I need to extract some data from a file like this: 我需要从这样的文件中提取一些数据：

(3269,
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>...
)
(194712,
<page>
<title>AssistiveTechnology</title>
<ns>0</ns>
<id>23</id>.. 
) etc...

This file was generated using: 该文件是使用以下命令生成的：

val conf = new Configuration
conf.set("textinputformat.record.delimiter", "</page>")
val rdd=sc.newAPIHadoopFile("sample.bz2", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
rdd.map{case (k,v) => (k.get(), new String(v.copyBytes()))}

I need to obtain the title content. 我需要获取标题内容。 Im using regex but the output file still remains empty. 我正在使用正则表达式，但是输出文件仍然为空。 My code is like this: 我的代码是这样的：

val xx = rdd.map(x => x._2).filter(x => x.matches(".*<title>([A-Za-z]+)<\\/title>.*"))

I also try with these: 我也尝试以下方法：

".*<title>([A-Za-z]+)</title>.*"

And using this: 并使用这个：

val reg = ".*<title>([\\w]+)</title>.*".r
val xx = rdd.map(x => x._2).filter(x => reg.pattern.matcher(x).matches)

I create the .jar using sbt and running with spark-submit. 我使用sbt创建.jar并使用spark-submit运行。

BTW, using spark-shell it works :S 顺便说一句，使用spark-shell它起作用：S

I need your help please. 我需要你的帮助。 Thanks. 谢谢。

Answer 1

You could use built-in Scala support for XML. 您可以使用内置的Scala对XML的支持。 Something like 就像是

import scala.xml._ 导入scala.xml._
rdd.map(x => (XML.loadString(x._2) \\ "title").text) rdd.map（x =>（XML.loadString（x._2）\\“ title”）。text）

使用Scala的io.Text RDD上的正则表达式

问题描述

1 个解决方案

解决方案1
1 2017-04-01 07:44:15

使用Scala的io.Text RDD上的正则表达式

问题描述

1 个解决方案

解决方案1 1 2017-04-01 07:44:15

解决方案1
1 2017-04-01 07:44:15