简体   繁体   中英

Fastest way to encode characters in a list of list of strings

For an NLP task, given a mapping, I need to encode each unicode character to an integer in a list of list of words. I'm trying to figure out a quick way to do this without dropping down into cython.

Here's a slowish way to write the function:

def encode(sentences, mappings):
    encoded_sentences = []
    for sentence in sentences:
        encoded_sentence = []
        for word in sentence:
            encoded_word = []
            for ch in word:
                encoded_word.append(mappings[ch])
            encoded_sentence.append(encoded_word)
        encoded_sentences.append(encoded_sentence)
    return encoded_sentences

Given the following input data:

my_sentences = [['i', 'need', 'to'],
            ['tend', 'to', 'tin']]

mappings = {'i': 0, 'n': 1, 'e': 2, 'd':3, 't':4, 'o':5}

I want encode(my_sentences, mappings) to produce:

[[[0], [1, 2, 2, 3], [4, 5]],
 [[4, 2, 1, 3], [4, 5], [4, 0, 1]]]

List comprehensions are 23% faster (and more concise):

%%timeit
encode(my_sentences, mappings)
100000 loops, best of 3: 4.75 µs per loop

def encode_compr(sentences, mappings):
    return [[[mappings[char] for char in word] for word in sent] for sent in sentences]

%%timeit
encode_compr(my_sentences, mappings)
100000 loops, best of 3: 3.67 µs per loop

Slower alternatives (for the record)

string.translate

import string
_from = 'inedto'
_to = '012345'
trans = string.maketrans(_from, _to)

def encode_translate(sentences, mappings):
    return [[[int(string.translate(char, trans)) for char in word] for word in sent] for sent in sentences]    

%%timeit
encode_translate(my_sentences, mappings)
100000 loops, best of 3: 17.4 µs per loop

You can gain something from str.translate (Py2) or bytes.translate (Py3) if you use it for bulk conversions, avoiding working with heavyweight individual int values, instead doing C level transformations. The end result isn't a list of list s though, it's a list of bytearray (or bytes on Py3); for a collection of small int values (from 0 to 255) bytearray is largely equivalent (it's a mutable sequence, just like list ) so it's often all you need. First, outside the function, you make a translation table:

# Py2
import string
transtable = string.maketrans('inedto', str(bytearray(range(6))))

# Py3
transtable = bytes.maketrans(b'inedto', bytes(range(6)))

Then use it in the function:

# Py2
def encode(sentences, mappings):
    return [[bytearray(w).translate(mappings) for w in sentence]
            for sentence in sentences]

# Py3
def encode(sentences, mappings):
    # Assumes sentences are bytes, not str; can be tweaked to work with str
    # but it will be slower/more complicated
    return [[w.translate(mappings) for w in sentence]
            for sentence in sentences]

Timings on Py2 achieve results in ~2/3rds the time of the original:

>>> %timeit -r5 encode(my_sentences, mapping)  # Original function
100000 loops, best of 5: 4.51 µs per loop
>>> %timeit -r5 encode(my_sentences, transtable)  # My alternate function
100000 loops, best of 5: 2.97 µs per loop

Py3 improvements are similar, but only if the input sentences are bytes objects, not str (and therefore avoid extra conversions). It all depends on what you can live with on input and output format.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM