简体   繁体   中英

how to get scala string split to match python

I am using spark-shell and pyspark to do word count on one article. scala flatmap on line.split(" ") and python split() get different word counts (scala has more). I tried split(" +") and split("\\W+") in the scala code, but can not get the count to come down to the same as the python one.

Anyone knows what pattern would match python exactly?

Python's str.split() has some special behaviour for default separator:

runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [] .

For example, ' 1 2 3 '.split() returns ['1', '2', '3']

The easiest way to fully match this in Scala is probably like this:

scala> """\S+""".r.findAllIn(" 1  2   3  ").toList
res0: List[String] = List(1, 2, 3)

scala> """\S+""".r.findAllIn("   ").toList
res1: List[String] = List()

scala> """\S+""".r.findAllIn("").toList
res2: List[String] = List()

Another way is to trim() the string beforehand:

scala> " 1  2   3  ".trim().split("""\s+""")
res3: Array[String] = Array(1, 2, 3)

But that doesn't have the same behaviour as Python for empty strings:

scala> "".trim().split("""\s+""")
res4: Array[String] = Array("")

In Scala split() of an empty string returns an array with one element, but in Python the result is a list with zero elements.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM