[英]Tokenize a list
我喜歡使用列表項作為分隔符來標記列表。
有沒有一種pythonic的方式來做到這一點,或者我必須自己寫一些東西。
Data=['Label',23,'NORM','|','RESP',1.256,None,'|','','','|','RELV','','']
SubList = TokenizeList (Data,Delim='|')
打印 SubList 會導致
[ ['Label',23,'NORM'] , ['RESP',1.256,None] , ['',''] , ['RELV','',''] ]
是的,您可以使用itertools.groupby
:
>>> from itertools import groupby
>>> Data=['Label',23,'NORM','|','RESP',1.256,None,'|','','','|','RELV','','']
>>> [list(g) for k,g in groupby(Data,key=lambda x:x == '|') if not k]
[['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', '']]
您當然可以創建一個函數:
def splitList(sequence, delimiter):
return [list(g) for k, g in groupby(sequence, key = lambda x: x == delimiter) if not k]
>>> splitList(sequence = Data, delimiter = '|')
[['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', '']]
試試這個,它既簡單又直接(Pythonic 也是如此),
def tokenize_list(array, sep='|'):
result = []
_temp = []
for el in array:
if el == sep:
result.append(_temp)
_temp = []
else:
_temp.append(el)
if _temp: # Finally append list after for-loop, to store last vlaues present in _temp if exists.
result.append(_temp)
return result
輸出:
>>> data = ['Label',23,'NORM','|','RESP',1.256,None,'|','','','|','RELV','','', '|']
>>> tokenize_list(data)
[['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', '']]
嘗試這個:
def group_by_sep(items, sep='|'):
inner_list = []
for item in items:
if item == sep:
yield inner_list
inner_list = []
else:
inner_list.append(item)
if inner_list:
yield inner_list
Data=['Label',23,'NORM','|','RESP',1.256,None,'|','','','|','RELV','','','|','|','now','|']
SubList = list(group_by_sep(Data, '|'))
print(SubList)
# [['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', ''], [], ['now']]
請注意,這里可以使用itertools.groupby
方法,但它不等同於上述方法,並且對確切行為的控制較少:
import itertools
def group_by_sep2(items, sep='|'):
yield from (
list(g)
for k, g in itertools.groupby(items, key=lambda x: x == sep)
if not k)
SubList2 = list(group_by_sep2(Data, '|'))
print(SubList2)
# [['Label', 23, 'NORM'], ['RESP', 1.256, None], ['', ''], ['RELV', '', ''], ['now']]
它缺少兩個連續分隔符之間的空list
。
此外,它不如上面的直接方法有效:
%timeit list(group_by_sep(Data))
# 1000 loops, best of 3: 1.47 µs per loop
%timeit list(group_by_sep2(Data))
# 100 loops, best of 3: 4.01 µs per loop
%timeit list(group_by_sep(Data * 1000))
# 1000 loops, best of 3: 1.33 ms per loop
%timeit list(group_by_sep2(Data * 1000))
# 100 loops, best of 3: 2.83 ms per loop
%timeit list(group_by_sep(Data * 1000000))
# 1000 loops, best of 3: 1.67 s per loop
%timeit list(group_by_sep2(Data * 1000000))
# 100 loops, best of 3: 3.22 s per loop
基准測試表明,直接方法的速度提高了約 2 倍到約 3 倍。
(編輯將其全部編寫為生成器並包含更多邊緣情況)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.