简体   繁体   English

提供maxsplit时string.split对短字符串的行为

[英]The behavior of string.split for short strings when maxsplit is supplied

I recently ran across some intriguing behavior of the string.split method in python2.7, particularly with respect to short stings (less than around 25 chars, see below), that involves contrasting the behavior of: 我最近遇到了python2.7中string.split方法的一些有趣行为,特别是关于短刺(少于约25个字符,见下文),这涉及对比以下行为:

# Without maxplsit
my_string.split('A')

to

# With maxsplit=1
my_string.split('A', 1)

The second method is actually slower for short strings, and I'm quite curious as to why. 对于短字符串,第二种方法实际上较慢,我很好奇为什么。

The Test 考试

This first came about from a small call to timeit that my co-worker discovered: 这首先来自于我的同事发现的一个小型时间调用:

# Without maxsplit
$ python -m timeit -s "json_line='a|b|c'" "part_one='|'.split(json_line)[0]"
1000000 loops, best of 3: 0.274 usec per loop
# With maxsplit
$ python -m timeit -s "json_line='a|b|c'" "part_one='|'.split(json_line,1)[0]"
1000000 loops, best of 3: 0.461 usec per loop

I thought this was certainly curious, so I put together a more detailed test. 我认为这当然很好奇,所以我整理了一个更详细的测试。 First I wrote the following small function that generates random strings of a specified length consisting of the first ten capital letters: 首先,我编写了以下小函数,生成由前十个大写字母组成的指定长度的随机字符串:

from random import choice

# 'A' through 'J'
choices = map(chr, range(65, 75))

def make_random_string(length):
    return ''.join(choice(choices) for i in xrange(length))

Then I wrote a couple tester functions to repeatedly split and time randomly generated strings of a specified length: 然后我写了几个测试器函数来重复分割和计算指定长度的随机生成的字符串:

from timeit import timeit

def time_split_of_size(str_length, n_strs_to_split):
    times = []
    data = [make_random_string(str_length) for i in xrange(n_strs_to_split)]
    for s in data:
        t = timeit("'{s}'.split('A')".format(s=s),
                   setup="from __main__ import make_random_string",
                   number=1000)
        times.append(t)
    return times

def time_split_of_size_with_maxcount(str_length, n_strs_to_split):
    times = []
    data = [make_random_string(str_length) for i in xrange(n_strs_to_split)]
    for s in data:
        t = timeit("'{s}'.split('A', 1)".format(s=s),
                   setup="from __main__ import make_random_string",
                   number=1000)
        times.append(t)
    return times

I then ran these testing methods over strings of varying sizes: 然后我在不同大小的字符串上运行这些测试方法:

from collections import OrderedDict
d = OrderedDict({})
for str_length in xrange(10, 10*1000, 25):
    no_maxcount = mean(time_split_of_size(str_length, 20))
    with_maxcount = mean(time_split_of_size_with_maxcount(str_length, 20))
    d[str_length] = [no_maxcount, with_maxcount]

This gives you the behavior you would expect, O(1) for the method with maxsplit=1 and O(n) for splitting all the way. 这为您提供了所期望的行为,对于maxsplit = 1和O(n)的方法,O(1)一直进行拆分。 Here's a plot of the time by the length of the string, the barely visible green curve is with maxsplit=1 and the blue curve is without: 下面是字符串长度的时间图,几乎看不见的绿色曲线是maxsplit=1而蓝色曲线没有:

StringSplitTiming

None the less, the behavior my co-worker discovered for small stings is real. 尽管如此,我的同事为小叮咬发现的行为是真实的。 Here's some code that times many splits of short stings: 这里有一些代码可以解释短蜇的很多分裂:

from collections import OrderedDict
d = OrderedDict({})
for str_length in xrange(1, 50, 2):
    no_maxcount = mean(time_split_of_size(str_length, 500))
    with_maxcount = mean(time_split_of_size_with_maxcount(str_length, 500))
    d[str_length] = [no_maxcount, with_maxcount]

With the following results: 结果如下:

StringSplitShortString

It seems like there is some overhead for strings less than 25 or so characters in length. 对于长度小于25个字符的字符串,似乎有一些开销。 The shape of the green curve is also quite curious, how it increases parallel to the blue before leveling off permenantly. 绿色曲线的形状也很奇怪,它是如何平行于蓝色的,然后永久平衡。

I took a look at the source code, which you may find here: 我看了一下源代码,你可以在这里找到:

stringobject.c (line 1449) stringlib/split.h (line 105) stringobject.c (第1449行) stringlib / split.h (第105行)

but nothing obvious jumped out at me. 但没有任何明显的事情向我跳出来。

Any idea what is causing the overhead when maxsplit is passed for the short strings? 知道在为短字符串传递maxsplit时导致开销的原因是什么?

The difference actually has nothing to do with what's going on inside string_split . 差异实际上与string_split发生的事情string_split In fact, the time spent inside that function is always slightly longer for default split than for maxsplit=1 , even if there are no splits to be done. 实际上,即使没有要进行拆分,默认拆分时间内该函数所花费的时间总是略长于maxsplit=1 And it's not the PyArg_ParseTuple difference (the best report I can get without instrumenting the interpreter says it takes 0ns either way, so whatever difference there is, it's not going to matter). 这不是PyArg_ParseTuple区别(如果没有检测解释器,我能得到的最好的报告说无论哪种方式都需要0ns,所以无论有什么不同,都没关系)。

The difference is that it takes an extra bytecode to pass an extra parameter. 不同之处在于需要额外的字节码来传递额外的参数。

As Stefan Pochmann suggested, you can tell this by testing with an explicit maxsplit=-1 : 正如Stefan Pochmann建议的那样,您可以通过使用显式maxsplit=-1进行测试来判断:

In [212]: %timeit ''.split('|')
1000000 loops, best of 3: 267 ns per loop
In [213]: %timeit ''.split('|', 1)
1000000 loops, best of 3: 303 ns per loop
In [214]: %timeit ''.split('|', -1)
1000000 loops, best of 3: 307 ns per loop

So even in this minimal example, the -1 is slightly slower than the 1 . 所以即使在这个最小的例子中, -11略慢。 But we're talking about 4ns of extra work. 但我们谈论的是4ns的额外工作。 (I'm pretty sure this 4ns is because of preallocating a list of size 12 instead of size 2 , but I don't want to run through a profiler just to make sure.) (我很确定这4ns是因为预先分配了大小为12而不是大小为2的列表 ,但我不想通过探查器来确保。)

Meanwhile, an NOP bytecode takes 32ns to evaluate on my system (from another answer I'm still trying to find…). 同时,一个NOP字节码需要32ns来评估我的系统(从另一个答案,我仍然试图找到...)。 I can't imagine that LOAD_CONST is faster than NOP . 我无法想象LOAD_CONSTNOP快。

So, until you're doing enough work to overwhelm that 32ns+, not passing a maxsplit argument will save you time. 所以,直到你做足够的工作来压倒32ns +,不通过maxsplit参数将节省你的时间。

In case it isn't obvious, here's the disassembly for the two cases: 如果不明显,这里是两种情况的反汇编:

  1           0 LOAD_CONST               0 ('')
              3 LOAD_ATTR                0 (split)
              6 LOAD_CONST               1 ('|')
              9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             12 RETURN_VALUE

  1           0 LOAD_CONST               0 ('')
              3 LOAD_ATTR                0 (split)
              6 LOAD_CONST               1 ('|')
              9 LOAD_CONST               3 (-1)
             12 CALL_FUNCTION            2 (2 positional, 0 keyword pair)
             15 RETURN_VALUE

For similar examples: 对于类似的例子:

In [217]: %timeit int()
10000000 loops, best of 3: 94 ns per loop
In [218]: %timeit int(0)
10000000 loops, best of 3: 134 ns per loop
In [235]: %timeit [0].pop()
1000000 loops, best of 3: 229 ns per loop
In [236]: %timeit [0].pop(0)
1000000 loops, best of 3: 270 ns per loop

So the LOAD_CONST takes about 40ns in both these cases, just like passing -1 instead of no argument for split . 所以LOAD_CONST在这两种情况下大约需要40ns,就像传递-1而不是没有参数进行split

Python 3.4 is a little harder to test, because it caches some things that 2.7 doesn't, but it looks like it's about 33ns to pass an extra argument—or 533ns if it's a keyword argument. Python 3.4有点难以测试,因为它缓存了一些2.7没有的东西,但看起来传递一个额外的参数大概是33ns - 如果是关键字参数则是533ns。 So, if you need to split tiny strings a billion times in Python 3, use s.split('|', 10) , not s.split('|', maxsplit=10) . 因此,如果你需要在Python 3 s.split('|', 10)小字符串拆分十亿次,请使用s.split('|', 10) ,而不是s.split('|', maxsplit=10)

the proper initial test (original test had json_line and '|' mixed up) 正确的初始测试(原始测试有json_line and '|'混合)

python -m timeit -s "json_line='a|b|c'" "part_one=json_line.split('|')[0]"
1000000 loops, best of 3: 0.239 usec per loop
python -m timeit -s "json_line='a|b|c'" "part_one=json_line.split('|',1)[0]"
1000000 loops, best of 3: 0.267 usec per loop

The time difference is smaller. 时差较小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM