How to improve the speed of splitting a list?

Question

I just want to improve the speed of splitting a list.Now I have a way to split the list, but the speed is not as fast as I expected.

def split_list(lines):
        return [x for xs in lines for x in xs.split('-')]

import time

lst= []
for i in range(1000000):
    lst.append('320000-320000')

start=time.clock()
lst_new=split_list(lst)
end=time.clock()
print('time\n',str(end-start))

For example, Input :

lst
 ['320000-320000', '320000-320000']

Output :

lst_new
 ['320000', '320000', '320000', '320000']

I'm not satisfied with the speed of spliting,as my data contains many lists.

But now I don't know whether there's a more effective way to do it.

According to advice,I try to describe my whole question more specifically.

import pandas as pd

df = pd.DataFrame({ 'line':["320000-320000, 340000-320000, 320000-340000",
                            "380000-320000",
                            "380000-320000,380000-310000",
                            "370000-320000,370000-320000,320000-320000",
                            "320000-320000, 340000-320000, 320000-340000",
                            "380000-320000",
                            "380000-320000,380000-310000",
                            "370000-320000,370000-320000,320000-320000",
                            "320000-320000, 340000-320000, 320000-340000",
                            "380000-320000",
                            "380000-320000,380000-310000",
                            "370000-320000,370000-320000,320000-320000"], 'id':[1,2,3,4,5,6,7,8,9,10,11,12],})

def most_common(lst):
    return max(set(lst), key=lst.count)

def split_list(lines):
    return [x for xs in lines for x in xs.split('-')]

df['line']=df['line'].str.split(',')
col_ix=df['line'].index.values
df['line_start'] = pd.Series(0, index=df.index)
df['line_destination'] = pd.Series(0, index=df.index)

import time 
start=time.clock()

for ix in col_ix:
    col=df['line'][ix]
    col_split=split_list(col)
    even_col_split=col_split[0:][::2]
    even_col_split_most=most_common(even_col_split)
    df['line_start'][ix]=even_col_split_most

    odd_col_split=col_split[1:][::2]

    odd_col_split_most=most_common(odd_col_split)
    df['line_destination'][ix]=odd_col_split_most

end=time.clock()
print('time\n',str(end-start))

del df['line']
print('df\n',df)

Input :

df
 id                                         line
0    1  320000-320000, 340000-320000, 320000-340000
1    2                                380000-320000
2    3                  380000-320000,380000-310000
3    4    370000-320000,370000-320000,320000-320000
4    5  320000-320000, 340000-320000, 320000-340000
5    6                                380000-320000
6    7                  380000-320000,380000-310000
7    8    370000-320000,370000-320000,320000-320000
8    9  320000-320000, 340000-320000, 320000-340000
9   10                                380000-320000
10  11                  380000-320000,380000-310000
11  12    370000-320000,370000-320000,320000-320000

Output :

df
 id  line_start  line_destination
0    1     320000    320000
1    2     380000    320000
2    3     380000    320000
3    4     370000    320000
4    5     320000    320000
5    6     380000    320000
6    7     380000    320000
7    8     370000    320000
8    9     320000    320000
9   10     380000    320000
10  11     380000    320000
11  12     370000    320000

You can regard the number of line (eg. 320000-32000 represent the starting point and destination of the route).

Expected : Make the code run faster.(I can't bear the speed of the code)

Answer 1

'-'.join(lst).split('-')

seems quite a bit faster:

>>> timeit("'-'.join(lst).split('-')", globals=globals(), number=10)
1.0838123590219766
>>> timeit("[x for xs in lst for x in xs.split('-')]", globals=globals(), number=10)
3.1370303670410067

Answer 2

Depending on what you want to do with your list, using a genertor can be slightly faster.

If you need to keep the output stored, then the list solution is faster.

If all you need to is to iterate over the words once, you can get rid of some overhead by using a generator.

def split_list_gen(lines):
    for line in lines:
        yield from line.split('-')

Benchmark

import time

lst = ['32000-32000'] * 10000000

start = time.clock()
for x in split_list(lst):
    pass
end = time.clock()
print('list time:', str(end - start))

start = time.clock()
for y in split_list_gen(lst):
    pass
end = time.clock()
print('generator time:', str(end - start))

Output

The generator solution is consistently about 10% faster.

list time: 0.4568295369982612
generator time: 0.4020671741918084

Answer 3

Pushing more of the work below the Python level seems to provide a small speedup:

In [7]: %timeit x = split_list(lst)
407 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit x = list(chain.from_iterable(map(methodcaller("split", "-"), lst
   ...: )))
374 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

methodcaller creates a function that calls the function for you:

methodcaller("split", "-")(x) == x.split("-")

chain.from_iterable creates a single iterator consisting of the elements from a group of iterables:

list(chain.from_iterable([[1,2], [3,4]])) == [1,2,3,4]

map ping the function returned by methodcaller on to your list of strings produces an iterable of lists suitable for flattening by from_iterable . The benefit of this more functional approach is that the functions involved are all implemented in C and can work with the data in the Python objects, rather than Python byte code that works on the Python objects.

How to improve the speed of splitting a list?

Question

3 answers

solution1
3 ACCPTED 2018-06-11 13:36:30

solution2
2 2018-06-11 13:12:47

Benchmark

Output

solution3
1 2018-06-11 13:35:43

How to improve the speed of splitting a list?

Question

3 answers

solution1 3 ACCPTED 2018-06-11 13:36:30

solution2 2 2018-06-11 13:12:47

Benchmark

Output

solution3 1 2018-06-11 13:35:43

solution1
3 ACCPTED 2018-06-11 13:36:30

solution2
2 2018-06-11 13:12:47

solution3
1 2018-06-11 13:35:43