Map function to write on global spark rdd

Question

I have a RDD of strings. Each line corresponding to various logs.

I have multiple regex in one single function that match/case the lines of the RDD to apply the adapted regex.

I want to map this unique function on my RDD, so it can process every lines fastly, and store each line processed in an other global rdd.

Problem is, as I want this task to be parallelized, my global RDD must be accessible concurrently to add every processed lines.

I was wondering if there was an other way to do this or anything ! I'm looking to improve my spark skills.

For example, this is what I wanna do :

I have a txt like :

ERROR : Hahhaha param_error=8 param_err2=https

WARNING : HUHUHUHUH param_warn=tchu param_warn2=wifi

My regex function will match the lines containing "ERROR" with an array for example Array("Error","8","https")

And another regex function will match the lines containing "WARNING" with an array for example Array("Warning","tchu","wifi")

At the end, I wanna obtain a RDD[Array[String]] for every line processed.

How do I keep it parallelized with Spark ?

Answer 1

First, it's important to understand that there's nothing like a "global RDD" in Spark, nor is there a reason you'd need something like that. When using Spark, you should think in terms of transforming one RDD into another, and not in terms of updating RDDs (which is impossible - RDDs are immutable ). Each such transformation will be executed distributedly (in parallel) by Spark.

In this case, if I understand your requirement correctly, you'd want to map each record into one of the following results:

an Array[String] where the first item is "ERROR" , or:
an Array[String] where the first item is "WARNING" , or:
if no pattern matched the record, remove it

To do that, you can use the map(f) and collect(f) methods of RDD :

// Sample data:
val rdd = sc.parallelize(Seq(
  "ERROR : Hahhaha param_error=8 param_err2=https",
  "WARNING : HUHUHUHUH param_warn=tchu param_warn2=wifi",
  "Garbage - not matching anything"
))

// First we can split in " : " to easily identify ERROR vs. WARNING 
val splitPrefix = rdd.map(line => line.split(" : "))

// Implement these parsing functions as you see fit; 
// The input would be the part following the " : ", 
// and the output should be a list of the values (not including the ERROR / WARNING) 
def parseError(v: String): List[String] = ??? // example input: "Hahhaha param_error=8 param_err2=https"
def parseWarning(v: String): List[String] = ??? // example input: "HUHUHUHUH param_warn=tchu param_warn2=wifi"

// Now we can use these functions in a pattern-matching function passed to RDD.collect,
// which will transform each value that matches one of the cases, and will filter out 
// values that don't match anything
val result: RDD[List[String]] = splitPrefix.collect {
  case Array(l @ "ERROR", v) => l :: parseError(v)
  case Array(l @ "WARNING", v) => l :: parseWarning(v)
  // NOT adding a default case, so records that didn't match will be removed
}    

// If you really want Array[String] and not List[String]:    
val arraysRdd: RDD[Array[String]] = result.map(_.toArray)

Map function to write on global spark rdd

Question

1 answers

solution1
2 ACCPTED 2017-07-14 16:54:33

Map function to write on global spark rdd

Question

1 answers

solution1 2 ACCPTED 2017-07-14 16:54:33

solution1
2 ACCPTED 2017-07-14 16:54:33