如何让scala字符串拆分以匹配python

Question

I am using spark-shell and pyspark to do word count on one article. 我正在使用spark-shell和pyspark对一篇文章进行字数统计。 scala flatmap on line.split(" ") and python split() get different word counts (scala has more). line.split（“”）和python split（）上的scala flatmap获得不同的字数（scala有更多）。 I tried split(" +") and split("\\W+") in the scala code, but can not get the count to come down to the same as the python one. 我在scala代码中尝试了split（“+”）和split（“\\ W +”），但无法将计数归结为与python相同。

Anyone knows what pattern would match python exactly? 有谁知道什么模式会完全匹配python？

Answer 1

Python's str.split() has some special behaviour for default separator: Python的str.split()对默认分隔符有一些特殊的行为：

runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. 连续空格的运行被视为单个分隔符，如果字符串具有前导或尾随空格，则结果将在开头或结尾处不包含空字符串。 Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [] . 因此，将空字符串或仅由空格组成的字符串拆分为None分隔符将返回[] 。

For example, ' 1 2 3 '.split() returns ['1', '2', '3'] 例如， ' 1 2 3 '.split()返回['1', '2', '3']

The easiest way to fully match this in Scala is probably like this: 在Scala中完全匹配它的最简单方法可能是这样的：

scala> """\S+""".r.findAllIn(" 1  2   3  ").toList
res0: List[String] = List(1, 2, 3)

scala> """\S+""".r.findAllIn("   ").toList
res1: List[String] = List()

scala> """\S+""".r.findAllIn("").toList
res2: List[String] = List()

Another way is to trim() the string beforehand: 另一种方法是事先trim()字符串：

scala> " 1  2   3  ".trim().split("""\s+""")
res3: Array[String] = Array(1, 2, 3)

But that doesn't have the same behaviour as Python for empty strings: 但是对于空字符串，它与Python没有相同的行为：

scala> "".trim().split("""\s+""")
res4: Array[String] = Array("")

In Scala split() of an empty string returns an array with one element, but in Python the result is a list with zero elements. 在Scala中，空字符串的split()返回一个包含一个元素的数组，但在Python中，结果是一个零元素的列表。

如何让scala字符串拆分以匹配python

问题描述

1 个解决方案

解决方案1
0 2015-05-02 23:24:39

如何让scala字符串拆分以匹配python

问题描述

1 个解决方案

解决方案1 0 2015-05-02 23:24:39

解决方案1
0 2015-05-02 23:24:39