在字符串列表列表中编码字符的最快方法

Question

对于NLP任务，给定映射，我需要将每个unicode字符编码为单词列表列表中的整数。 我试图找出一个快速的方法来做到这一点，而不是直接进入cython。

这是编写函数的一种缓慢的方法：

def encode(sentences, mappings):
    encoded_sentences = []
    for sentence in sentences:
        encoded_sentence = []
        for word in sentence:
            encoded_word = []
            for ch in word:
                encoded_word.append(mappings[ch])
            encoded_sentence.append(encoded_word)
        encoded_sentences.append(encoded_sentence)
    return encoded_sentences

给出以下输入数据：

my_sentences = [['i', 'need', 'to'],
            ['tend', 'to', 'tin']]

mappings = {'i': 0, 'n': 1, 'e': 2, 'd':3, 't':4, 'o':5}

我想要encode(my_sentences, mappings)来产生：

[[[0], [1, 2, 2, 3], [4, 5]],
 [[4, 2, 1, 3], [4, 5], [4, 0, 1]]]

Answer 1

列表理解速度提高了23％（更简洁）：

%%timeit
encode(my_sentences, mappings)
100000 loops, best of 3: 4.75 µs per loop

def encode_compr(sentences, mappings):
    return [[[mappings[char] for char in word] for word in sent] for sent in sentences]

%%timeit
encode_compr(my_sentences, mappings)
100000 loops, best of 3: 3.67 µs per loop

较慢的替代品（记录）

string.translate

import string
_from = 'inedto'
_to = '012345'
trans = string.maketrans(_from, _to)

def encode_translate(sentences, mappings):
    return [[[int(string.translate(char, trans)) for char in word] for word in sent] for sent in sentences]    

%%timeit
encode_translate(my_sentences, mappings)
100000 loops, best of 3: 17.4 µs per loop

Answer 2

如果你将它用于批量转换，你可以从str.translate （Py2）或bytes.translate （Py3）中获得一些东西，避免使用重量级的单个int值，而是进行C级转换。 最终的结果不是list的list ，它是一个bytearray list （或Py3上的bytes ）; 对于一个小的int值集合（从0到255）， bytearray在很大程度上是等价的（它是一个可变序列，就像list一样），所以它通常都是你需要的。 首先，在函数外部，您创建一个转换表：

# Py2
import string
transtable = string.maketrans('inedto', str(bytearray(range(6))))

# Py3
transtable = bytes.maketrans(b'inedto', bytes(range(6)))

然后在函数中使用它：

# Py2
def encode(sentences, mappings):
    return [[bytearray(w).translate(mappings) for w in sentence]
            for sentence in sentences]

# Py3
def encode(sentences, mappings):
    # Assumes sentences are bytes, not str; can be tweaked to work with str
    # but it will be slower/more complicated
    return [[w.translate(mappings) for w in sentence]
            for sentence in sentences]

Py2上的计时在原始时间内达到约2/3的结果：

>>> %timeit -r5 encode(my_sentences, mapping)  # Original function
100000 loops, best of 5: 4.51 µs per loop
>>> %timeit -r5 encode(my_sentences, transtable)  # My alternate function
100000 loops, best of 5: 2.97 µs per loop

Py3的改进是类似的，但前提是输入句子是bytes对象，而不是str （因此避免额外的转换）。 这完全取决于您在输入和输出格式上可以使用的内容。

在字符串列表列表中编码字符的最快方法

问题描述

2 个解决方案

解决方案1
3 2017-09-14 23:25:34

较慢的替代品（记录）

解决方案2
3 2017-09-14 23:51:10

在字符串列表列表中编码字符的最快方法

问题描述

2 个解决方案

解决方案1 3 2017-09-14 23:25:34

较慢的替代品（记录）

解决方案2 3 2017-09-14 23:51:10

解决方案1
3 2017-09-14 23:25:34

解决方案2
3 2017-09-14 23:51:10