简体   繁体   中英

Scala structured data extraction via regex and pattern matching

Given structured data in a string format how do I extract parts of the data effectively using pattern matching and regular expressions?

Example:

val input = Seq("name-12345","inval1d-12345","invalid-12here123","hello-54321","inval1d-1aa2")

case class Client(name:Option[String],clientID:Option[Int])

def parseClient(input:String):Option[Client] = {
  val clientRegex = """([a-zA-Z]+)-([0-9]+)""".r
  Option(input).flatMap(in => {
    in match {
      case clientRegex(name,clientID) => Some(Client(Some(name),Some(clientID.toInt)))
      case _ => None
    }
  })
}

input.map(parseClient)

The issue with this however is that if I fail to validate a single part of the structured data then I parse None of it.

How could I use regular expressions to define in a hierarchical manor such as:

val nameRegex = """([a-zA-Z]+)""".r
val clientIDRegex = """([0-9]+)""".r

Then match these combined within a pattern?

The output from the example:

Seq(
 Some(Client(Some("name"),Some(12345)))
 ,None
 ,None
 ,Some(Client(Some("hello"),Some(54321)))
 ,None
)

The required output:

Seq(
 Some(Client(Some("name"),Some(12345)))
 ,Some(Client(None,Some(12345)))
 ,Some(Client(Some("invalid"),None))
 ,Some(Client(Some("hello"),Some(54321)))
 ,None
)

This should give the expected outcome:

val input = Seq("name-12345", "inval1d-12345", "invalid-12here123", "hello-54321")

case class Client(name: Option[String], clientID: Option[Int])

def parseClient(input: String): Option[Client] = {
  val clientRegex = """(?:([a-zA-Z^-]+)|[^-]*)-(?:([0-9]+)|.*)""".r
  input match {
    case clientRegex(null, null) => None
    case clientRegex(name, id) => Some(Client(Option(name), Option(id).map(_.toInt)))
    case _ =>
      None
  }
}

input.map(parseClient)

I removed the flatMap construct since this was unnecessary. Interesting part here is the regex:

"""(?:([a-zA-Z^-]+)|[^-]*)-(?:([0-9]+)|.*)"""

I made changed it so it expects either the correct values and therefore captures it in the group ( ([a-zA-Z^-]+) for name and ([0-9]+) for id ) but also added the other cases (no valid name or id). Everything is in non-capture groups (?:) so it is grouped correctly.

If something is not as expected in the capture groups, the group will be null, which is handled in the match-case.

EDIT Made a correction to the code so that it works for completely invalid input and removed unnecessary if-statements

EDIT 2 Adapted the code according to comment of OP taking advantage of Option(null) => None evaluation

You are probably looking for something like applicative you can chain. you can do something like this using the Validated type from cats:

val houseNumber = parseClient("house_number").andThen{ n =>
   if (isValid(n)) Validated.valid(n)
   else Validated.invalid(ParseError("house_number"))
}

and I would opt to using to atto : it has the ParseResult type the keeps all the information on parsing the string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM