[英]Fastest way to encode characters in a list of list of strings
对于NLP任务,给定映射,我需要将每个unicode字符编码为单词列表列表中的整数。 我试图找出一个快速的方法来做到这一点,而不是直接进入cython。
这是编写函数的一种缓慢的方法:
def encode(sentences, mappings):
encoded_sentences = []
for sentence in sentences:
encoded_sentence = []
for word in sentence:
encoded_word = []
for ch in word:
encoded_word.append(mappings[ch])
encoded_sentence.append(encoded_word)
encoded_sentences.append(encoded_sentence)
return encoded_sentences
给出以下输入数据:
my_sentences = [['i', 'need', 'to'],
['tend', 'to', 'tin']]
mappings = {'i': 0, 'n': 1, 'e': 2, 'd':3, 't':4, 'o':5}
我想要encode(my_sentences, mappings)
来产生:
[[[0], [1, 2, 2, 3], [4, 5]],
[[4, 2, 1, 3], [4, 5], [4, 0, 1]]]
列表理解速度提高了23%(更简洁):
%%timeit
encode(my_sentences, mappings)
100000 loops, best of 3: 4.75 µs per loop
def encode_compr(sentences, mappings):
return [[[mappings[char] for char in word] for word in sent] for sent in sentences]
%%timeit
encode_compr(my_sentences, mappings)
100000 loops, best of 3: 3.67 µs per loop
string.translate
import string
_from = 'inedto'
_to = '012345'
trans = string.maketrans(_from, _to)
def encode_translate(sentences, mappings):
return [[[int(string.translate(char, trans)) for char in word] for word in sent] for sent in sentences]
%%timeit
encode_translate(my_sentences, mappings)
100000 loops, best of 3: 17.4 µs per loop
如果你将它用于批量转换,你可以从str.translate
(Py2)或bytes.translate
(Py3)中获得一些东西,避免使用重量级的单个int
值,而是进行C级转换。 最终的结果不是list
的list
,它是一个bytearray
list
(或Py3上的bytes
); 对于一个小的int
值集合(从0到255), bytearray
在很大程度上是等价的(它是一个可变序列,就像list
一样),所以它通常都是你需要的。 首先,在函数外部,您创建一个转换表:
# Py2
import string
transtable = string.maketrans('inedto', str(bytearray(range(6))))
# Py3
transtable = bytes.maketrans(b'inedto', bytes(range(6)))
然后在函数中使用它:
# Py2
def encode(sentences, mappings):
return [[bytearray(w).translate(mappings) for w in sentence]
for sentence in sentences]
# Py3
def encode(sentences, mappings):
# Assumes sentences are bytes, not str; can be tweaked to work with str
# but it will be slower/more complicated
return [[w.translate(mappings) for w in sentence]
for sentence in sentences]
Py2上的计时在原始时间内达到约2/3的结果:
>>> %timeit -r5 encode(my_sentences, mapping) # Original function
100000 loops, best of 5: 4.51 µs per loop
>>> %timeit -r5 encode(my_sentences, transtable) # My alternate function
100000 loops, best of 5: 2.97 µs per loop
Py3的改进是类似的,但前提是输入句子是bytes
对象,而不是str
(因此避免额外的转换)。 这完全取决于您在输入和输出格式上可以使用的内容。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.