简体   繁体   中英

Scala - String matches RegEx

This is on Scala 2.11.8

I'm trying to read and parse a text file in Scala. Seeing an unexpected behavior (for me) when trying to do string.matches .

Say I have a file.txt with below contents

#############
# HEADING 1
#############

- The zeroth line item, if there can be one
- First Line item
- Second Line item
- Here is the third
    and this one has some details
- A fourth one followed by empty line

- Fifth line item

Read the file, and parse the contents, thus -

val source = scala.io.Source.fromFile("file.txt")
val lines = try source.getLines.filterNot(_.matches("#.*")).mkString("\n") finally source.close
val items = lines.split("""(\n-|^-)\s""").filter(_.nonEmpty)

Now, trying to parse individual line items with their result:

// print the first few items
scala> items(0)
res0: String = The zeroth line item, if there can be one

scala> items(1)
res1: String = First Line item

scala> items(3)
res2: String =
Here is the third
    and this one has some details

scala> items(4)
res3: String =
"A fourth one followed by empty line
"

scala> items(5)
res4: String =
"Fifth line item

"

Now for some matching

// Matching the items with RegEx
scala> items(0).matches("The.*")
res5: Boolean = true

scala> items(1).matches("First.*")
res6: Boolean = true

scala> items(3).matches("Here is.*")
res7: Boolean = false                    // ??

scala> items(4).matches("A fourth.*")
res8: Boolean = false                    // ??


// But startsWith seems to recognize it just fine!
scala> items(3).startsWith("Here is")
res9: Boolean = true

scala> items(4).startsWith("A fourth")
res10: Boolean = true

// Even this doesn't match
scala> items(4).matches(".*A fourth.*")
res11: Boolean = false                    // ?

My observation is this happens only when the item contains anything but a single line. ie when the item spans multiple lines (including having an empty following line)

Is this behavior expected? How to consistently match using RegEx?

Consider activating the DOTALL mode using the (?s) flag in the beginning of the regex. Example:

val text = 
  """|- The zeroth line item, if there can be one
     |- First Line item
     |- Second Line item
     |- Here is the third
     |    and this one has some details
     |- A fourth one followed by empty line
     |
     |- Fifth line item
     |
     |""".stripMargin


val items = text.split("""(\n-|^-)\s""").filter(_.nonEmpty)

def describeMatch(str: String, regex: String): Unit = {
  println("-" * 60)
  println("The string\n>>>%s<<<\n%s".format(
    str,
    (if (str.matches(regex)) "Matches" else "Doesn't match") + s" >>>$regex<<<"
  ))
}

describeMatch(items(0), "The.*")
describeMatch(items(1), "First.*")
describeMatch(items(3), "Here is.*")
describeMatch(items(3), "(?s)Here is.*")
describeMatch(items(4), "A fourth.*")
describeMatch(items(4), "(?s)A fourth.*")
describeMatch(items(4), ".*A fourth.*$")
describeMatch(items(4), "(?s)^A fourth.*$")

The output should speak for itself:

------------------------------------------------------------
The string
>>>The zeroth line item, if there can be one<<<
Matches >>>The.*<<<
------------------------------------------------------------
The string
>>>First Line item<<<
Matches >>>First.*<<<
------------------------------------------------------------
The string
>>>Here is the third
    and this one has some details<<<
Doesn't match >>>Here is.*<<<
------------------------------------------------------------
The string
>>>Here is the third
    and this one has some details<<<
Matches >>>(?s)Here is.*<<<
------------------------------------------------------------
The string
>>>A fourth one followed by empty line
<<<
Doesn't match >>>A fourth.*<<<
------------------------------------------------------------
The string
>>>A fourth one followed by empty line
<<<
Matches >>>(?s)A fourth.*<<<
------------------------------------------------------------
The string
>>>A fourth one followed by empty line
<<<
Doesn't match >>>.*A fourth.*$<<<
------------------------------------------------------------
The string
>>>A fourth one followed by empty line
<<<
Matches >>>(?s)^A fourth.*$<<<

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM