简体   繁体   English

Spark-如何正确处理RDD.map()方法中的错误情况?

[英]Spark - How to handle error case in RDD.map() method correctly?

I am trying to do some text processing using Spark RDD. 我正在尝试使用Spark RDD进行一些文本处理。

The format of the input file is: 输入文件的格式为:

2015-05-20T18:30 <some_url>/?<key1>=<value1>&<key2>=<value2>&...&<keyn>=<valuen>

I want to extract some fields from the text and convert them into CSV format like: 我想从文本中提取一些字段并将其转换为CSV格式,例如:

<value1>,<value5>,<valuek>,<valuen>

The following code is how I do this: 以下代码是我如何执行此操作:

val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val records = lines.map { line =>
    val mp = line.split("&")
                 .map(_.split("="))
                 .filter(_.length >= 2)
                 .map(t => (t(0), t(1))).toMap

    (mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
}

I would like to know that, if some line of the input text is of wrong format or invalid, then the map() function cannot return a valid value. 我想知道,如果输入文本的某些行格式错误或无效,则map()函数将无法返回有效值。 This should very common in text processing, what is the best practice to deal with this problem? 这在文本处理中应该很常见,解决此问题的最佳实践是什么?

in order to manage this errors you can use the scala's class Try within a flatMap operation, in code: 为了管理此错误,您可以在代码的flatMap操作中使用scala的类Try:

    val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
    val records = lines.flatMap (line =>
        Try{
          val mp = line.split("&")
              .map(_.split("="))
              .filter(_.length >= 2)
              .map(t => (t(0), t(1))).toMap

          (mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
      } match {
        case Success(map) => Seq(map)
        case _ => Seq()
    })

With this you have only the "good ones" but if you want both (the errors and the good ones) i would recommend to use a map function that returns a Scala Either and then use a Spark filter, in code: 有了这个,您只有“好人”,但是如果您同时想要(错误和好人),我建议在代码中使用返回Scala Either的map函数,然后使用Spark过滤器:

    val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
    val goodBadRecords = lines.map (line =>
        Try{
          val mp = line.split("&")
              .map(_.split("="))
              .filter(_.length >= 2)
              .map(t => (t(0), t(1))).toMap

          (mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
      } match {
        case Success(map) => Right(map)
        case Failure(e) => Left(e)
    })
    val records = goodBadRecords.filter(_.isRight)
    val errors = goodBadRecords.filter(_.isLeft)

I hope this will be useful 我希望这会有用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM