简体   繁体   中英

Spark - How to handle error case in RDD.map() method correctly?

I am trying to do some text processing using Spark RDD.

The format of the input file is:

2015-05-20T18:30 <some_url>/?<key1>=<value1>&<key2>=<value2>&...&<keyn>=<valuen>

I want to extract some fields from the text and convert them into CSV format like:

<value1>,<value5>,<valuek>,<valuen>

The following code is how I do this:

val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val records = lines.map { line =>
    val mp = line.split("&")
                 .map(_.split("="))
                 .filter(_.length >= 2)
                 .map(t => (t(0), t(1))).toMap

    (mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
}

I would like to know that, if some line of the input text is of wrong format or invalid, then the map() function cannot return a valid value. This should very common in text processing, what is the best practice to deal with this problem?

in order to manage this errors you can use the scala's class Try within a flatMap operation, in code:

    val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
    val records = lines.flatMap (line =>
        Try{
          val mp = line.split("&")
              .map(_.split("="))
              .filter(_.length >= 2)
              .map(t => (t(0), t(1))).toMap

          (mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
      } match {
        case Success(map) => Seq(map)
        case _ => Seq()
    })

With this you have only the "good ones" but if you want both (the errors and the good ones) i would recommend to use a map function that returns a Scala Either and then use a Spark filter, in code:

    val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
    val goodBadRecords = lines.map (line =>
        Try{
          val mp = line.split("&")
              .map(_.split("="))
              .filter(_.length >= 2)
              .map(t => (t(0), t(1))).toMap

          (mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
      } match {
        case Success(map) => Right(map)
        case Failure(e) => Left(e)
    })
    val records = goodBadRecords.filter(_.isRight)
    val errors = goodBadRecords.filter(_.isLeft)

I hope this will be useful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM