简体   繁体   中英

Fast checking if a string can be converted to float or int in python

I need to convert all strings in a large array to int or float types, if they can be converted. Usually, people suggest try-except or regex approach (like in Checking if a string can be converted to float in Python ), but it turns out to be very slow.

The question is: how to write that code the fastest way possible?

I found that there is .isdigit() method of a string. Is there something like that for floats?

Here is the current (slow) code.

    result = []
    for line in lines:
        resline = []
        for item in line:
            try:
                resline.append(int(item))       
            except:        
                try:
                    resline.append(float(item))     
                except:
                    resline.append(item)            
        result.append(resline)          
    return np.array(result)

There is also some evidence ( https://stackoverflow.com/a/2356970/3642151 ) that regex approach is even slower.

Your return value shows you are using NumPy. Therefore, you should be using np.loadtxt or np.genfromtxt (with the dtype=None parameter) to load the lines into a NumPy array. The dtype=None parameter will automatically detect if the string can be converted to a float or int .

np.loadtxt is faster and requires less memory than np.genfromtxt , but requires you to specify the dtype -- there is no dtype=None automatic-dtype-detection option. See Joe Kington's post for a comparsion .

If you find loading the CSV using np.loadtxt or np.genfromtxt is still too slow, then using Panda's read_csv function is much much faster , but (of course) would require you to install Pandas first, and the result would be a Pandas DataFrame, not a NumPy array. DataFrames have many nice features (and can be converted into NumPy arrays), so you may find this to be an advantage not only in terms of loading speed but also for data manipulation.


By the way, if you don't specify the dtype in the call

np.array(data)

then np.array uses a single dtype for all the data. If your data contains both ints and floats, then np.array will return an array with a float dtype:

In [91]: np.array([[1, 2.0]]).dtype
Out[91]: dtype('float64')

Even worse, if your data contains numbers and strings, np.array(data) will return an array of string dtype:

In [92]: np.array([[1, 2.0, 'Hi']]).dtype
Out[92]: dtype('S32')

So all the hard work you go through checking which strings are ints or floats gets destroyed in the very last line. np.genfromtxt(..., dtype=None) gets around this problem by returning a structured array (one with heterogenous dtype).

Try profiling your Python script, you'll find out that try... except , float or int are not the most time consuming calls in your script.

import random
import string
import cProfile

def profile_str2float(calls):
    for x in xrange(calls):
        str2float(random_str(100))

def str2float(string):
    try:
        return float(string)
    except ValueError:
        return None

def random_str(length):
    return ''.join(random.choice(string.lowercase) for x in xrange(length))

cProfile.run('profile_str2float(10**5)', sort='cumtime')

Running this script I get the following results:

         40400003 function calls in 14.721 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   14.721   14.721 <string>:1(<module>)
        1    0.126    0.126   14.721   14.721 str2float.py:5(profile_str2float)
   100000    0.111    0.000   14.352    0.000 str2float.py:15(random_str)
   100000    1.413    0.000   14.241    0.000 {method 'join' of 'str' objects}
 10100000    4.393    0.000   12.829    0.000 str2float.py:16(<genexpr>)
 10000000    7.115    0.000    8.435    0.000 random.py:271(choice)
 10000000    0.760    0.000    0.760    0.000 {method 'random' of '_random.Random' objects}
 10000000    0.559    0.000    0.559    0.000 {len}
   100000    0.242    0.000    0.242    0.000 str2float.py:9(str2float)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

As you can see from the Cumulative Time stat, str2float function is not consuming much CPU time, in 100.000 calls it barely uses 250ms.

All generalizations are false (irony intended). One cannot say that try: except: is always faster than regex or vice versa. In your case, regex is not overkill and would be much faster than the try: except: method. However, based on our discussions in the comments section of your question, I went ahead and implemented a C library that efficiently performs this conversion (since I see this question a lot on SO); the library is called fastnumbers . Below are timing tests using your try: except: method, using regex, and using fastnumbers .


from __future__ import print_function
import timeit

prep_code = '''\
import random
import string
x = [''.join(random.sample(string.ascii_letters, 7)) for _ in range(10)]
y = [str(random.randint(0, 1000)) for _ in range(10)]
z = [str(random.random()) for _ in range(10)]
'''

try_method = '''\
def converter_try(vals):
    resline = []
    for item in vals:
        try:
            resline.append(int(item))
        except ValueError:
            try:
                resline.append(float(item))
            except ValueError:
                resline.append(item)

'''

re_method = '''\
import re
int_match = re.compile(r'[+-]?\d+$').match
float_match = re.compile(r'[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?$').match
def converter_re(vals):
    resline = []
    for item in vals:
        if int_match(item):
            resline.append(int(item))
        elif float_match(item):
            resline.append(float(item))
        else:
            resline.append(item)

'''

fn_method = '''\
from fastnumbers import fast_real
def converter_fn(vals):
    resline = []
    for item in vals:
        resline.append(fast_real(item))

'''

print('Try with non-number strings', timeit.timeit('converter_try(x)', prep_code+try_method), 'seconds')
print('Try with integer strings', timeit.timeit('converter_try(y)', prep_code+try_method), 'seconds')
print('Try with float strings', timeit.timeit('converter_try(z)', prep_code+try_method), 'seconds')
print()
print('Regex with non-number strings', timeit.timeit('converter_re(x)', prep_code+re_method), 'seconds')
print('Regex with integer strings', timeit.timeit('converter_re(y)', prep_code+re_method), 'seconds')
print('Regex with float strings', timeit.timeit('converter_re(z)', prep_code+re_method), 'seconds')
print()
print('fastnumbers with non-number strings', timeit.timeit('converter_fn(x)', prep_code+fn_method), 'seconds')
print('fastnumbers with integer strings', timeit.timeit('converter_fn(y)', prep_code+fn_method), 'seconds')
print('fastnumbers with float strings', timeit.timeit('converter_fn(z)', prep_code+fn_method), 'seconds')
print()

The output looks like this on my machine:

Try with non-number strings 55.1374599934 seconds
Try with integer strings 11.8999788761 seconds
Try with float strings 41.8258318901 seconds

Regex with non-number strings 11.5976541042 seconds
Regex with integer strings 18.1302199364 seconds
Regex with float strings 19.1559209824 seconds

fastnumbers with non-number strings 4.02173805237 seconds
fastnumbers with integer strings 4.21903610229 seconds
fastnumbers with float strings 4.96900391579 seconds

A few things are pretty clear

  • try: except: is very slow for non-numeric input; regex beats that handily
  • try: except: becomes more efficient if exceptions don't need to be raised
  • fastnumbers beats the pants off both in all cases

So, if you don't want to use fastnumbers , you need to assess if you are more likely to encounter invalid strings or valid strings, and base your algorithm choice on that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM