简体   繁体   English

连接字符串的速度比附加到列表的速度更快

[英]Concatenating strings faster than appending to lists

I am trying to isolate specific items in a list (eg [0, 1, 1] will return [0, 1] ). 我试图隔离列表中的特定项目(例如[0, 1, 1]将返回[0, 1] )。 I managed to get through this, but I noticed something strange. 我设法解决了这个问题,但是我发现了一些奇怪的事情。

When I tried to append to list it ran about 7 times slower then when I was concatenating strings and then splitting it. 当我尝试追加列表时,它的运行速度要比连接字符串然后拆分时慢7倍。

This is my code: 这是我的代码:

import time
start = time.time()

first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]

values = first + second

distinct_string = ""

for i in values:
    if not str(i) in distinct_string:
        distinct_string += str(i) + " "

print(distinct_string.split())

print(" --- %s sec --- " % (start - time.time()))

This result end in about 5 seconds... Now for the lists: 结果大约在5秒钟内结束...现在列出:

import time
start = time.time()

first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]

values = first + second

distinct_list = []

for i in values:
    if not i in distinct_list:
        distinct_list.append(i)

print(distinct_list)

print(" --- %s sec --- " % (start - time.time()))

Runs at around 40 seconds. 运行约40秒。

What makes string faster even though I am converting a lot of values to strings? 即使我将许多值转换为字符串,是什么使字符串更快?

Note that it's generally better to use timeit to compare functions, which runs the same thing multiple times to get average performance, and to factor out repeated code to focus on the performance that matters. 请注意,通常最好使用timeit来比较函数,该函数可以多次运行同一事物以获得平均性能,并排除重复的代码以专注于重要的性能。 Here's my test script: 这是我的测试脚本:

first = [x for x in range(999) if x % 2 == 0]
second = [x for x in range(999) if x % 4 == 0]

values = first + second

def str_method(values):
    distinct_string = ""
    for i in values:
        if not str(i) in distinct_string:
            distinct_string += str(i) + " "
    return [int(s) for s in distinct_string.split()]

def list_method(values):
    distinct_list = []
    for i in values:
        if not i in distinct_list:
            distinct_list.append(i)
    return distinct_list

def set_method(values):
    seen = set()
    return [val for val in values if val not in seen and seen.add(val) is None]

if __name__ == '__main__':
    assert str_method(values) == list_method(values) == set_method(values)
    import timeit
    funcs = [func.__name__ for func in (str_method, list_method, set_method)]
    setup = 'from __main__ import {}, values'.format(', '.join(funcs))
    for func in funcs:
        print(func)
        print(timeit.timeit(
            '{}(values)'.format(func),
            setup=setup,
            number=1000
        ))

I've added int conversion to make sure that the functions return the same thing, and get the following results: 我添加了int转换以确保函数返回相同的内容,并获得以下结果:

str_method
1.1685157899992191
list_method
2.6124089090008056
set_method
0.09523714500392089

Note that it is not true that searching in a list is faster than searching in a string if you have to convert the input: 请注意,如果必须转换输入,则在列表中搜索比在字符串中搜索要快是不正确的:

>>> timeit.timeit('1 in l', setup='l = [9, 8, 7, 6, 5, 4, 3, 2, 1]')
0.15300405000016326
>>> timeit.timeit('str(1) in s', setup='s = "9 8 7 6 5 4 3 2 1"')
0.23205067300295923

Repeated append ing to a list is not very efficient, as it means frequent resizing of the underlying object - the list comprehension, as shown in the set version, is more efficient. 反复append荷兰国际集团到一个列表是不是很有效,因为它意味着基础对象的频繁调整大小-列表理解,如所示set的版本,是更有效的。

searching in strings: 搜索字符串:

if not str(i) in distinct_string: 如果不是str(i)

is much faster 快得多

then searching in lists 然后在列表中搜索

if not i in distinct_list: 如果不是我在distinct_list中:

here are lprofile lines for string search in OP code

Line #      Hits         Time  Per Hit   % Time      Line Contents 


    17     75000     80366013   1071.5     92.7       if not str(i) in distinct_string:
    18     50000      2473212     49.5      2.9                  distinct_string += str(i) + " "

and for list search in OP code

   39     75000    769795432  10263.9     99.1          if not i in distinct_list:
   40     50000      2813804     56.3      0.4              distinct_list.append(i)

I think there is a flaw of logic that makes the string method seemingly much faster. 我认为存在逻辑缺陷,这使得字符串方法看起来更快。
When matching substrings in a long string, the in operator will return prematurely at the first substring containing the search item. 当匹配长字符串中的子字符串时, in运算符将在包含搜索项的第一个子字符串中过早返回。 To prove this, I let the loop run backwards from the highest values down to the smallest, and it returned only 50% of the values of the original loop (I checked the length of the result only). 为了证明这一点,我让循环从最高值向后运行到最小,然后只返回原始循环值的50%(我只检查了结果的长度)。 If the matching was exact there should be no difference whether you check the sequence from the start or from the end. 如果精确匹配,则从头开始还是从头开始检查序列都没有区别。 I conclude that the string method short-cuts a lot of comparisons by matching near the start of the long string. 我得出的结论是,字符串方法通过在长字符串的开头附近进行匹配来简化许多比较。 The particular choice of duplicates is unfortunately masking this. 不幸的是,重复项的特殊选择掩盖了这一点。

In a second test, I let the string method search for " " + str(i) + " " to eliminate substring matches. 在第二个测试中,我让字符串方法搜索" " + str(i) + " "以消除子字符串匹配。 Now it will run only about 2x faster than the list method (but still, faster). 现在,它的运行速度仅比list方法快2倍(但仍然更快)。

@jonrsharpe: Regarding the set_method I cannot see why one would touch all set elements one by one and not in one set statement like this: @jonrsharpe:关于set_method,我看不到为什么一个人一个接一个地触摸所有集合元素,而不是像这样在一个set语句中:

def set_method(values):
    return list(set(values))

This produces exactly the same output and runs about 2.5x faster on my PC. 这将产生完全相同的输出,并且在我的PC上运行速度大约快2.5倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM