简体   繁体   English

成对RDD中的Scala正则表达式

[英]scala regex in a paired RDD

I have a question regarding regex in RDD operations in Scala/Eclipse/Spark. 我对Scala / Eclipse / Spark中RDD操作中的正则表达式有疑问。

I have 2 data files which I have parsed, and joined together to form a RDD with paired [URL RegexOfURL], they look something like 我有2个已解析的数据文件,并与成对的[URL RegexOfURL]结合在一起形成一个RDD,它们看起来像

(http://coach.nationalexpress.com/nxbooking/journey-list,
(^https://www\.nationalexpress\.com/bps/confirmation\.cfm\?id=|^https://coach\.nationalexpress\.com/nxbooking/delivery-details))

I wish to run an operation such that each URL (the first part) is matched to the regex (the second part). 我希望运行一个使每个URL(第一部分)与正则表达式(第二部分)匹配的操作。 If the RegEx match, flag it with a true flag, else flag it false 如果RegEx匹配,则将其标记为true,否则将其标记为false

I have tried writing a function: 我尝试编写一个函数:

def operation(s1:RDD[String], s2:RDD[String]) = 
s1 match{
case s2 => 't'
case _ => 'f'
}

but the match is not what I want, I want to use the regex correctly, and is having trouble. 但是匹配不是我想要的,我想正确使用正则表达式,并且遇到了麻烦。

I also tried to break the RDD into each line and running a function with no success. 我还尝试将RDD分成每一行并运行一个没有成功的函数。 What would you suggest is the best way to do this? 您认为什么是最好的方法?

Thanks in advance 提前致谢

Given the input data is an RDD of pairs (string, regex) , where the regex is in String form: RDD[(String,String)] then this transformation should look something like this: 给定输入数据是一对(string, regex)的RDD,其中regexString形式: RDD[(String,String)]则此转换应类似于以下内容:

val urlMatchRegexRdd = urlRegexPairsRDD.map{case (url, regex) => url match {
    regex.r(_ *) => ((url, regex), true)
    _ => ((url, regex), false)
}

This will result in an RDD of the form RDD[((String, String),Boolean)] preserving the original information with the added regex match result. 这将导致RDD的形式为RDD[((String, String),Boolean)]保留原始信息以及添加的正则表达式匹配结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM