简体   繁体   中英

Concatenating strings faster than appending to lists

I am trying to isolate specific items in a list (eg [0, 1, 1] will return [0, 1] ). I managed to get through this, but I noticed something strange.

When I tried to append to list it ran about 7 times slower then when I was concatenating strings and then splitting it.

This is my code:

import time
start = time.time()

first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]

values = first + second

distinct_string = ""

for i in values:
    if not str(i) in distinct_string:
        distinct_string += str(i) + " "

print(distinct_string.split())

print(" --- %s sec --- " % (start - time.time()))

This result end in about 5 seconds... Now for the lists:

import time
start = time.time()

first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]

values = first + second

distinct_list = []

for i in values:
    if not i in distinct_list:
        distinct_list.append(i)

print(distinct_list)

print(" --- %s sec --- " % (start - time.time()))

Runs at around 40 seconds.

What makes string faster even though I am converting a lot of values to strings?

Note that it's generally better to use timeit to compare functions, which runs the same thing multiple times to get average performance, and to factor out repeated code to focus on the performance that matters. Here's my test script:

first = [x for x in range(999) if x % 2 == 0]
second = [x for x in range(999) if x % 4 == 0]

values = first + second

def str_method(values):
    distinct_string = ""
    for i in values:
        if not str(i) in distinct_string:
            distinct_string += str(i) + " "
    return [int(s) for s in distinct_string.split()]

def list_method(values):
    distinct_list = []
    for i in values:
        if not i in distinct_list:
            distinct_list.append(i)
    return distinct_list

def set_method(values):
    seen = set()
    return [val for val in values if val not in seen and seen.add(val) is None]

if __name__ == '__main__':
    assert str_method(values) == list_method(values) == set_method(values)
    import timeit
    funcs = [func.__name__ for func in (str_method, list_method, set_method)]
    setup = 'from __main__ import {}, values'.format(', '.join(funcs))
    for func in funcs:
        print(func)
        print(timeit.timeit(
            '{}(values)'.format(func),
            setup=setup,
            number=1000
        ))

I've added int conversion to make sure that the functions return the same thing, and get the following results:

str_method
1.1685157899992191
list_method
2.6124089090008056
set_method
0.09523714500392089

Note that it is not true that searching in a list is faster than searching in a string if you have to convert the input:

>>> timeit.timeit('1 in l', setup='l = [9, 8, 7, 6, 5, 4, 3, 2, 1]')
0.15300405000016326
>>> timeit.timeit('str(1) in s', setup='s = "9 8 7 6 5 4 3 2 1"')
0.23205067300295923

Repeated append ing to a list is not very efficient, as it means frequent resizing of the underlying object - the list comprehension, as shown in the set version, is more efficient.

searching in strings:

if not str(i) in distinct_string:

is much faster

then searching in lists

if not i in distinct_list:

here are lprofile lines for string search in OP code

Line #      Hits         Time  Per Hit   % Time      Line Contents 


    17     75000     80366013   1071.5     92.7       if not str(i) in distinct_string:
    18     50000      2473212     49.5      2.9                  distinct_string += str(i) + " "

and for list search in OP code

   39     75000    769795432  10263.9     99.1          if not i in distinct_list:
   40     50000      2813804     56.3      0.4              distinct_list.append(i)

I think there is a flaw of logic that makes the string method seemingly much faster.
When matching substrings in a long string, the in operator will return prematurely at the first substring containing the search item. To prove this, I let the loop run backwards from the highest values down to the smallest, and it returned only 50% of the values of the original loop (I checked the length of the result only). If the matching was exact there should be no difference whether you check the sequence from the start or from the end. I conclude that the string method short-cuts a lot of comparisons by matching near the start of the long string. The particular choice of duplicates is unfortunately masking this.

In a second test, I let the string method search for " " + str(i) + " " to eliminate substring matches. Now it will run only about 2x faster than the list method (but still, faster).

@jonrsharpe: Regarding the set_method I cannot see why one would touch all set elements one by one and not in one set statement like this:

def set_method(values):
    return list(set(values))

This produces exactly the same output and runs about 2.5x faster on my PC.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM