简体   繁体   中英

scala regex in a paired RDD

I have a question regarding regex in RDD operations in Scala/Eclipse/Spark.

I have 2 data files which I have parsed, and joined together to form a RDD with paired [URL RegexOfURL], they look something like

(http://coach.nationalexpress.com/nxbooking/journey-list,
(^https://www\.nationalexpress\.com/bps/confirmation\.cfm\?id=|^https://coach\.nationalexpress\.com/nxbooking/delivery-details))

I wish to run an operation such that each URL (the first part) is matched to the regex (the second part). If the RegEx match, flag it with a true flag, else flag it false

I have tried writing a function:

def operation(s1:RDD[String], s2:RDD[String]) = 
s1 match{
case s2 => 't'
case _ => 'f'
}

but the match is not what I want, I want to use the regex correctly, and is having trouble.

I also tried to break the RDD into each line and running a function with no success. What would you suggest is the best way to do this?

Thanks in advance

Given the input data is an RDD of pairs (string, regex) , where the regex is in String form: RDD[(String,String)] then this transformation should look something like this:

val urlMatchRegexRdd = urlRegexPairsRDD.map{case (url, regex) => url match {
    regex.r(_ *) => ((url, regex), true)
    _ => ((url, regex), false)
}

This will result in an RDD of the form RDD[((String, String),Boolean)] preserving the original information with the added regex match result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM