简体   繁体   English

如何让scala字符串拆分以匹配python

[英]how to get scala string split to match python

I am using spark-shell and pyspark to do word count on one article. 我正在使用spark-shell和pyspark对一篇文章进行字数统计。 scala flatmap on line.split(" ") and python split() get different word counts (scala has more). line.split(“”)和python split()上的scala flatmap获得不同的字数(scala有更多)。 I tried split(" +") and split("\\W+") in the scala code, but can not get the count to come down to the same as the python one. 我在scala代码中尝试了split(“+”)和split(“\\ W +”),但无法将计数归结为与python相同。

Anyone knows what pattern would match python exactly? 有谁知道什么模式会完全匹配python?

Python's str.split() has some special behaviour for default separator: Python的str.split()对默认分隔符有一些特殊的行为:

runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. 连续空格的运行被视为单个分隔符,如果字符串具有前导或尾随空格,则结果将在开头或结尾处不包含空字符串。 Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [] . 因此,将空字符串或仅由空格组成的字符串拆分为None分隔符将返回[]

For example, ' 1 2 3 '.split() returns ['1', '2', '3'] 例如, ' 1 2 3 '.split()返回['1', '2', '3']

The easiest way to fully match this in Scala is probably like this: 在Scala中完全匹配它的最简单方法可能是这样的:

scala> """\S+""".r.findAllIn(" 1  2   3  ").toList
res0: List[String] = List(1, 2, 3)

scala> """\S+""".r.findAllIn("   ").toList
res1: List[String] = List()

scala> """\S+""".r.findAllIn("").toList
res2: List[String] = List()

Another way is to trim() the string beforehand: 另一种方法是事先trim()字符串:

scala> " 1  2   3  ".trim().split("""\s+""")
res3: Array[String] = Array(1, 2, 3)

But that doesn't have the same behaviour as Python for empty strings: 但是对于空字符串,它与Python没有相同的行为:

scala> "".trim().split("""\s+""")
res4: Array[String] = Array("")

In Scala split() of an empty string returns an array with one element, but in Python the result is a list with zero elements. 在Scala中,空字符串的split()返回一个包含一个元素的数组,但在Python中,结果是一个元素的列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM