如何提高列表分割速度？

Question

我只是想提高拆分列表的速度。現在我有一種拆分列表的方法，但是速度不如我預期的快。

def split_list(lines):
        return [x for xs in lines for x in xs.split('-')]

import time

lst= []
for i in range(1000000):
    lst.append('320000-320000')

start=time.clock()
lst_new=split_list(lst)
end=time.clock()
print('time\n',str(end-start))

例如， Input ：

lst
 ['320000-320000', '320000-320000']

Output ：

lst_new
 ['320000', '320000', '320000', '320000']

我對拆分的速度不滿意，因為我的數據包含很多列表。

但是現在我不知道是否有更有效的方法來做到這一點。

根據建議，我嘗試更具體地描述我的整個問題。

import pandas as pd

df = pd.DataFrame({ 'line':["320000-320000, 340000-320000, 320000-340000",
                            "380000-320000",
                            "380000-320000,380000-310000",
                            "370000-320000,370000-320000,320000-320000",
                            "320000-320000, 340000-320000, 320000-340000",
                            "380000-320000",
                            "380000-320000,380000-310000",
                            "370000-320000,370000-320000,320000-320000",
                            "320000-320000, 340000-320000, 320000-340000",
                            "380000-320000",
                            "380000-320000,380000-310000",
                            "370000-320000,370000-320000,320000-320000"], 'id':[1,2,3,4,5,6,7,8,9,10,11,12],})

def most_common(lst):
    return max(set(lst), key=lst.count)

def split_list(lines):
    return [x for xs in lines for x in xs.split('-')]

df['line']=df['line'].str.split(',')
col_ix=df['line'].index.values
df['line_start'] = pd.Series(0, index=df.index)
df['line_destination'] = pd.Series(0, index=df.index)

import time 
start=time.clock()

for ix in col_ix:
    col=df['line'][ix]
    col_split=split_list(col)
    even_col_split=col_split[0:][::2]
    even_col_split_most=most_common(even_col_split)
    df['line_start'][ix]=even_col_split_most

    odd_col_split=col_split[1:][::2]

    odd_col_split_most=most_common(odd_col_split)
    df['line_destination'][ix]=odd_col_split_most

end=time.clock()
print('time\n',str(end-start))

del df['line']
print('df\n',df)

Input ：

df
 id                                         line
0    1  320000-320000, 340000-320000, 320000-340000
1    2                                380000-320000
2    3                  380000-320000,380000-310000
3    4    370000-320000,370000-320000,320000-320000
4    5  320000-320000, 340000-320000, 320000-340000
5    6                                380000-320000
6    7                  380000-320000,380000-310000
7    8    370000-320000,370000-320000,320000-320000
8    9  320000-320000, 340000-320000, 320000-340000
9   10                                380000-320000
10  11                  380000-320000,380000-310000
11  12    370000-320000,370000-320000,320000-320000

Output ：

df
 id  line_start  line_destination
0    1     320000    320000
1    2     380000    320000
2    3     380000    320000
3    4     370000    320000
4    5     320000    320000
5    6     380000    320000
6    7     380000    320000
7    8     370000    320000
8    9     320000    320000
9   10     380000    320000
10  11     380000    320000
11  12     370000    320000

你可以把數量line （如320000-32000代表路線的出發點和歸宿）。

Expected ：使代碼運行更快。（我不能忍受代碼的速度）

Answer 1

'-'.join(lst).split('-')

似乎快了很多：

>>> timeit("'-'.join(lst).split('-')", globals=globals(), number=10)
1.0838123590219766
>>> timeit("[x for xs in lst for x in xs.split('-')]", globals=globals(), number=10)
3.1370303670410067

Answer 2

根據您要對列表執行的操作，使用生成器可能會稍快一些。

如果需要保留輸出，則列表解決方案會更快。

如果您只需要遍歷單詞一次，則可以使用生成器來消除一些開銷。

def split_list_gen(lines):
    for line in lines:
        yield from line.split('-')

基准

import time

lst = ['32000-32000'] * 10000000

start = time.clock()
for x in split_list(lst):
    pass
end = time.clock()
print('list time:', str(end - start))

start = time.clock()
for y in split_list_gen(lst):
    pass
end = time.clock()
print('generator time:', str(end - start))

產量

發電機解決方案始終保持約10％的速度提高。

list time: 0.4568295369982612
generator time: 0.4020671741918084

Answer 3

將更多工作推到Python級別以下似乎可以提供較小的加速：

In [7]: %timeit x = split_list(lst)
407 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit x = list(chain.from_iterable(map(methodcaller("split", "-"), lst
   ...: )))
374 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

methodcaller創建一個為您調用該函數的函數：

methodcaller("split", "-")(x) == x.split("-")

chain.from_iterable創建一個單個迭代器，該迭代器由一組可迭代對象的元素組成：

list(chain.from_iterable([[1,2], [3,4]])) == [1,2,3,4]

map平安返回的功能methodcaller到您的字符串列表產生適合於扁平化列表的迭代from_iterable 。 這種更實用方法的好處是，涉及到的功能都用C語言實現，並可以在 Python對象中的數據，而不是對 Python對象工作的Python字節碼的工作。

如何提高列表分割速度？

問題描述

3 個解決方案

解決方案1
3 已采納 2018-06-11 13:36:30

解決方案2
2 2018-06-11 13:12:47

基准

產量

解決方案3
1 2018-06-11 13:35:43

如何提高列表分割速度？

問題描述

3 個解決方案

解決方案1 3 已采納 2018-06-11 13:36:30

解決方案2 2 2018-06-11 13:12:47

基准

產量

解決方案3 1 2018-06-11 13:35:43

解決方案1
3 已采納 2018-06-11 13:36:30

解決方案2
2 2018-06-11 13:12:47

解決方案3
1 2018-06-11 13:35:43