I need to convert all strings in a large array to int or float types, if they can be converted. Usually, people suggest try-except or regex approach (like in Checking if a string can be converted to float in Python ), but it turns out to be very slow.
The question is: how to write that code the fastest way possible?
I found that there is .isdigit() method of a string. Is there something like that for floats?
Here is the current (slow) code.
result = []
for line in lines:
resline = []
for item in line:
try:
resline.append(int(item))
except:
try:
resline.append(float(item))
except:
resline.append(item)
result.append(resline)
return np.array(result)
There is also some evidence ( https://stackoverflow.com/a/2356970/3642151 ) that regex approach is even slower.
Your return value shows you are using NumPy. Therefore, you should be using np.loadtxt or np.genfromtxt (with the dtype=None
parameter) to load the lines into a NumPy array. The dtype=None
parameter will automatically detect if the string can be converted to a float
or int
.
np.loadtxt
is faster and requires less memory than np.genfromtxt
, but requires you to specify the dtype
-- there is no dtype=None
automatic-dtype-detection option. See Joe Kington's post for a comparsion .
If you find loading the CSV using np.loadtxt
or np.genfromtxt
is still too slow, then using Panda's read_csv
function is much much faster , but (of course) would require you to install Pandas first, and the result would be a Pandas DataFrame, not a NumPy array. DataFrames have many nice features (and can be converted into NumPy arrays), so you may find this to be an advantage not only in terms of loading speed but also for data manipulation.
By the way, if you don't specify the dtype in the call
np.array(data)
then np.array
uses a single dtype for all the data. If your data contains both ints and floats, then np.array
will return an array with a float dtype:
In [91]: np.array([[1, 2.0]]).dtype
Out[91]: dtype('float64')
Even worse, if your data contains numbers and strings, np.array(data)
will return an array of string dtype:
In [92]: np.array([[1, 2.0, 'Hi']]).dtype
Out[92]: dtype('S32')
So all the hard work you go through checking which strings are ints
or floats
gets destroyed in the very last line. np.genfromtxt(..., dtype=None)
gets around this problem by returning a structured array (one with heterogenous dtype).
Try profiling your Python script, you'll find out that try... except
, float
or int
are not the most time consuming calls in your script.
import random
import string
import cProfile
def profile_str2float(calls):
for x in xrange(calls):
str2float(random_str(100))
def str2float(string):
try:
return float(string)
except ValueError:
return None
def random_str(length):
return ''.join(random.choice(string.lowercase) for x in xrange(length))
cProfile.run('profile_str2float(10**5)', sort='cumtime')
Running this script I get the following results:
40400003 function calls in 14.721 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 14.721 14.721 <string>:1(<module>)
1 0.126 0.126 14.721 14.721 str2float.py:5(profile_str2float)
100000 0.111 0.000 14.352 0.000 str2float.py:15(random_str)
100000 1.413 0.000 14.241 0.000 {method 'join' of 'str' objects}
10100000 4.393 0.000 12.829 0.000 str2float.py:16(<genexpr>)
10000000 7.115 0.000 8.435 0.000 random.py:271(choice)
10000000 0.760 0.000 0.760 0.000 {method 'random' of '_random.Random' objects}
10000000 0.559 0.000 0.559 0.000 {len}
100000 0.242 0.000 0.242 0.000 str2float.py:9(str2float)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
As you can see from the Cumulative Time stat, str2float
function is not consuming much CPU time, in 100.000 calls it barely uses 250ms.
All generalizations are false (irony intended). One cannot say that try: except:
is always faster than regex or vice versa. In your case, regex is not overkill and would be much faster than the try: except:
method. However, based on our discussions in the comments section of your question, I went ahead and implemented a C library that efficiently performs this conversion (since I see this question a lot on SO); the library is called fastnumbers . Below are timing tests using your try: except:
method, using regex, and using fastnumbers
.
from __future__ import print_function
import timeit
prep_code = '''\
import random
import string
x = [''.join(random.sample(string.ascii_letters, 7)) for _ in range(10)]
y = [str(random.randint(0, 1000)) for _ in range(10)]
z = [str(random.random()) for _ in range(10)]
'''
try_method = '''\
def converter_try(vals):
resline = []
for item in vals:
try:
resline.append(int(item))
except ValueError:
try:
resline.append(float(item))
except ValueError:
resline.append(item)
'''
re_method = '''\
import re
int_match = re.compile(r'[+-]?\d+$').match
float_match = re.compile(r'[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?$').match
def converter_re(vals):
resline = []
for item in vals:
if int_match(item):
resline.append(int(item))
elif float_match(item):
resline.append(float(item))
else:
resline.append(item)
'''
fn_method = '''\
from fastnumbers import fast_real
def converter_fn(vals):
resline = []
for item in vals:
resline.append(fast_real(item))
'''
print('Try with non-number strings', timeit.timeit('converter_try(x)', prep_code+try_method), 'seconds')
print('Try with integer strings', timeit.timeit('converter_try(y)', prep_code+try_method), 'seconds')
print('Try with float strings', timeit.timeit('converter_try(z)', prep_code+try_method), 'seconds')
print()
print('Regex with non-number strings', timeit.timeit('converter_re(x)', prep_code+re_method), 'seconds')
print('Regex with integer strings', timeit.timeit('converter_re(y)', prep_code+re_method), 'seconds')
print('Regex with float strings', timeit.timeit('converter_re(z)', prep_code+re_method), 'seconds')
print()
print('fastnumbers with non-number strings', timeit.timeit('converter_fn(x)', prep_code+fn_method), 'seconds')
print('fastnumbers with integer strings', timeit.timeit('converter_fn(y)', prep_code+fn_method), 'seconds')
print('fastnumbers with float strings', timeit.timeit('converter_fn(z)', prep_code+fn_method), 'seconds')
print()
The output looks like this on my machine:
Try with non-number strings 55.1374599934 seconds
Try with integer strings 11.8999788761 seconds
Try with float strings 41.8258318901 seconds
Regex with non-number strings 11.5976541042 seconds
Regex with integer strings 18.1302199364 seconds
Regex with float strings 19.1559209824 seconds
fastnumbers with non-number strings 4.02173805237 seconds
fastnumbers with integer strings 4.21903610229 seconds
fastnumbers with float strings 4.96900391579 seconds
A few things are pretty clear
try: except:
is very slow for non-numeric input; regex beats that handily try: except:
becomes more efficient if exceptions don't need to be raised fastnumbers
beats the pants off both in all cases So, if you don't want to use fastnumbers
, you need to assess if you are more likely to encounter invalid strings or valid strings, and base your algorithm choice on that.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.