简体   繁体   中英

Most efficient way to turn a list of strings of integers to an array of integers

I've got a simple problem - I need to convert a string of integers to a list of integers and insert it into a numpy array.

I have code that works but I'm interested in a more efficient method if there is one. The starting condition is that I have a list of strings of integers (line 4) and the goal is to get a numpy array filled with those integers.

Here is an example of the code I use:

import numpy as np
print("Hello StackOverflow")

listOfStringOfINTs = ["123231231231231"]*5
print(listOfStringOfINTs)
numpyVectorOfInts = np.empty([len(listOfStringOfINTs),len(listOfStringOfINTs[0]) ], dtype='int')
for i, IntString in enumerate(listOfStringOfINTs):
    numpyVectorOfInts[i] = list(map(int, IntString))

print(numpyVectorOfInts)

I'm not sure this is better in speed, but it's simpler:

In [68]: np.array([list(astr) for astr in listOfStringOfINTs],int)           
Out[68]: 
array([[1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1]])

list(astr) split the string into a list of 1 character strings. The np.array with int dtype takes care of converting all strings.

Or you could join all strings into string, make the list, and then reshape the array:

np.array(list(''.join(listOfStringOfINTs)),int).reshape(5,-1)

Leveraging the fact that all strings have the same number of characters, we can use a vectorized one with view -

def get_int_ar(a):
    return (np.array(a).view('u1')-48).reshape(len(a),-1)

Sample run -

In [143]: listOfStringOfINTs = ["123231231231231"]*5

In [144]: get_int_ar(listOfStringOfINTs)
Out[144]: 
array([[1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1]], dtype=uint8)

Just for fun, here is another way to do it:

>>> np.vstack(np.frombuffer(a,dtype=np.uint8)-48 for a in listOfStringOfINTs)
array([[1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1],
       [1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1]], dtype=uint8)

This method relies on reading in the ASCII characters as unsigned chars and then relying on the fact that the numerals 1-9 are in order in the ASCII representation. Since the numeral 0 is represented as 48, we just subtract 48 from all of the values to get their value as an integer.

For small strings it's not really faster than @hpaulj's method which is more readable:

In [1]: listOfStringOfINTs = ["123231231231231"]*10000

In [2]: %timeit np.vstack(np.frombuffer(a,dtype=np.uint8)-48 for a in listOfStringOfINTs)
10 loops, best of 3: 42.1 ms per loop

In [3]: %timeit np.array([list(astr) for astr in listOfStringOfINTs],int)
10 loops, best of 3: 36.3 ms per loop

But for large strings it can make a big difference:

In [4]: listOfStringOfINTs = ["123231231231231"*1000]*10000

In [5]: %timeit np.vstack(np.frombuffer(a,dtype=np.uint8)-48 for a in listOfStringOfINTs)
10 loops, best of 3: 115 ms per loop

In [6]: %timeit np.array([list(astr) for astr in listOfStringOfINTs],int)
1 loop, best of 3: 30.4 s per loop

All the above answers are correct, but intuitively, the easiest to understand for me is:

    >>> k = [list(x) for x in listOfStringOfINTs ]
    >>> print(np.array(k, dtype=np.int64))
    [[1 2 3 2 3 1 2 3 1 2 3 1 2 3 1]
     [1 2 3 2 3 1 2 3 1 2 3 1 2 3 1]
     [1 2 3 2 3 1 2 3 1 2 3 1 2 3 1]
     [1 2 3 2 3 1 2 3 1 2 3 1 2 3 1]
     [1 2 3 2 3 1 2 3 1 2 3 1 2 3 1]]

Here is a soln using "".join :

def digit_ize(a):
    r = np.fromstring(''.join(a), 'u1')
    r &= 0x0f
    return r.reshape(len(a), -1)

or (slightly faster):

def digit_ize(a):
    r = np.frombuffer(''.join(a).encode(), 'u1') & 0x0f
    return r.reshape(len(a), -1)

Timings:

small
pp1 4.314555088058114
pp2 2.933372976258397
div 3.740947926416993
usr 29.473979957401752
hpj 12.974489014595747
large
pp1 9.718517074361444
pp2 7.069707033224404
div 37.66830707900226
usr 2321.8201039126143
hpj 1103.1720889732242

Script to produce timings, contains Py3 adjustments of other solns where necessary.

import numpy as np

def digit_ize():
    r = np.fromstring(''.join(a), 'u1')
    r &= 0x0f
    return r.reshape(len(a), -1)

def digit_ize_2():
    r = np.frombuffer(''.join(a).encode(), 'u1') & 0x0f
    return r.reshape(len(a), -1)

def get_int_ar():
    return (np.array(a, 'S').view('u1')-48).reshape(len(a),-1)

def use_vstack():
    np.vstack(np.frombuffer(b.encode(), dtype=np.uint8)-48 for b in a)

def use_list():
    return np.array([list(astr) for astr in a],int)           

from timeit import timeit

listOfStringOfINTs = ["123231231231231"]*5
a = listOfStringOfINTs
print("small")
print("pp1", timeit(digit_ize, number=1000)*1000)
print("pp2", timeit(digit_ize_2, number=1000)*1000)
print("div", timeit(get_int_ar, number=1000)*1000)
print("usr", timeit(use_vstack, number=1000)*1000)
print("hpj", timeit(use_list, number=1000)*1000)
a = a*100
print("large")
print("pp1", timeit(digit_ize, number=1000)*1000)
print("pp2", timeit(digit_ize_2, number=1000)*1000)
print("div", timeit(get_int_ar, number=1000)*1000)
print("usr", timeit(use_vstack, number=1000)*1000)
print("hpj", timeit(use_list, number=1000)*1000)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM