映射函数以在全局spark rdd上写

Question

I have a RDD of strings. 我有一个字符串的RDD。 Each line corresponding to various logs. 每行对应各种日志。

I have multiple regex in one single function that match/case the lines of the RDD to apply the adapted regex. 我在一个函数中有多个正则表达式，它们匹配/区分RDD的行以应用适应的正则表达式。

I want to map this unique function on my RDD, so it can process every lines fastly, and store each line processed in an other global rdd. 我想将此唯一函数映射到我的RDD上，以便它可以快速处理每一行，并将处理过的每一行存储在另一个全局rdd中。

Problem is, as I want this task to be parallelized, my global RDD must be accessible concurrently to add every processed lines. 问题是，由于我希望此任务并行化，因此必须可以同时访问我的全局RDD才能添加每个处理的行。

I was wondering if there was an other way to do this or anything ! 我想知道是否还有其他方法可以做到这一点！ I'm looking to improve my spark skills. 我希望提高我的火花技能。

For example, this is what I wanna do : 例如，这就是我想要做的：

I have a txt like : 我有一个txt：

ERROR : Hahhaha param_error=8 param_err2=https 错误：哈哈哈param_error = 8 param_err2 = https

WARNING : HUHUHUHUH param_warn=tchu param_warn2=wifi 警告：HUHUHUHUH param_warn = tchu param_warn2 = wifi

My regex function will match the lines containing "ERROR" with an array for example Array("Error","8","https") 我的正则表达式函数会将包含“ ERROR”的行与数组进行匹配，例如Array("Error","8","https")

And another regex function will match the lines containing "WARNING" with an array for example Array("Warning","tchu","wifi") 另一个正则表达式函数将包含数组“ WARNING”的行与数组进行匹配，例如Array("Warning","tchu","wifi")

At the end, I wanna obtain a RDD[Array[String]] for every line processed. 最后，我想为处理的每一行获取一个RDD[Array[String]] 。

How do I keep it parallelized with Spark ? 如何使其与Spark并行化？

Answer 1

First, it's important to understand that there's nothing like a "global RDD" in Spark, nor is there a reason you'd need something like that. 首先，重要的是要了解，Spark中没有像“全局RDD”这样的东西，也没有理由需要类似的东西。 When using Spark, you should think in terms of transforming one RDD into another, and not in terms of updating RDDs (which is impossible - RDDs are immutable ). 使用Spark时，应该考虑将一个RDD转换为另一个RDD，而不要考虑更新 RDD（这是不可能的-RDD是不可变的 ）。 Each such transformation will be executed distributedly (in parallel) by Spark. 每个此类转换将由Spark分布式（并行）执行。

In this case, if I understand your requirement correctly, you'd want to map each record into one of the following results: 在这种情况下，如果我正确理解您的要求，则希望将每个记录map到以下结果之一：

an Array[String] where the first item is "ERROR" , or: Array[String] ，其中第一项是"ERROR" ，或者：
an Array[String] where the first item is "WARNING" , or: Array[String] ，其中第一项是"WARNING" ，或者：
if no pattern matched the record, remove it 如果没有模式与记录匹配，请将其删除

To do that, you can use the map(f) and collect(f) methods of RDD : 为此，您可以使用RDD的map(f)和collect(f)方法：

// Sample data:
val rdd = sc.parallelize(Seq(
  "ERROR : Hahhaha param_error=8 param_err2=https",
  "WARNING : HUHUHUHUH param_warn=tchu param_warn2=wifi",
  "Garbage - not matching anything"
))

// First we can split in " : " to easily identify ERROR vs. WARNING 
val splitPrefix = rdd.map(line => line.split(" : "))

// Implement these parsing functions as you see fit; 
// The input would be the part following the " : ", 
// and the output should be a list of the values (not including the ERROR / WARNING) 
def parseError(v: String): List[String] = ??? // example input: "Hahhaha param_error=8 param_err2=https"
def parseWarning(v: String): List[String] = ??? // example input: "HUHUHUHUH param_warn=tchu param_warn2=wifi"

// Now we can use these functions in a pattern-matching function passed to RDD.collect,
// which will transform each value that matches one of the cases, and will filter out 
// values that don't match anything
val result: RDD[List[String]] = splitPrefix.collect {
  case Array(l @ "ERROR", v) => l :: parseError(v)
  case Array(l @ "WARNING", v) => l :: parseWarning(v)
  // NOT adding a default case, so records that didn't match will be removed
}    

// If you really want Array[String] and not List[String]:    
val arraysRdd: RDD[Array[String]] = result.map(_.toArray)

映射函数以在全局spark rdd上写

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-07-14 16:54:33

映射函数以在全局spark rdd上写

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-07-14 16:54:33

解决方案1
2 已采纳 2017-07-14 16:54:33