[英]how to get scala string split to match python
I am using spark-shell and pyspark to do word count on one article. 我正在使用spark-shell和pyspark对一篇文章进行字数统计。 scala flatmap on line.split(" ") and python split() get different word counts (scala has more). line.split(“”)和python split()上的scala flatmap获得不同的字数(scala有更多)。 I tried split(" +") and split("\\W+") in the scala code, but can not get the count to come down to the same as the python one. 我在scala代码中尝试了split(“+”)和split(“\\ W +”),但无法将计数归结为与python相同。
Anyone knows what pattern would match python exactly? 有谁知道什么模式会完全匹配python?
Python's str.split()
has some special behaviour for default separator: Python的str.split()
对默认分隔符有一些特殊的行为:
runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. 连续空格的运行被视为单个分隔符,如果字符串具有前导或尾随空格,则结果将在开头或结尾处不包含空字符串。 Consequently, splitting an empty string or a string consisting of just whitespace with a
None
separator returns[]
. 因此,将空字符串或仅由空格组成的字符串拆分为None
分隔符将返回[]
。For example,
' 1 2 3 '.split()
returns['1', '2', '3']
例如,' 1 2 3 '.split()
返回['1', '2', '3']
The easiest way to fully match this in Scala is probably like this: 在Scala中完全匹配它的最简单方法可能是这样的:
scala> """\S+""".r.findAllIn(" 1 2 3 ").toList
res0: List[String] = List(1, 2, 3)
scala> """\S+""".r.findAllIn(" ").toList
res1: List[String] = List()
scala> """\S+""".r.findAllIn("").toList
res2: List[String] = List()
Another way is to trim()
the string beforehand: 另一种方法是事先trim()
字符串:
scala> " 1 2 3 ".trim().split("""\s+""")
res3: Array[String] = Array(1, 2, 3)
But that doesn't have the same behaviour as Python for empty strings: 但是对于空字符串,它与Python没有相同的行为:
scala> "".trim().split("""\s+""")
res4: Array[String] = Array("")
In Scala split()
of an empty string returns an array with one element, but in Python the result is a list with zero elements. 在Scala中,空字符串的split()
返回一个包含一个元素的数组,但在Python中,结果是一个零元素的列表。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.