简体   繁体   English

快速检查字符串是否可以在python中转换为float或int

[英]Fast checking if a string can be converted to float or int in python

I need to convert all strings in a large array to int or float types, if they can be converted. 我需要将大数组中的所有字符串都转换为int或float类型,如果它们可以转换的话。 Usually, people suggest try-except or regex approach (like in Checking if a string can be converted to float in Python ), but it turns out to be very slow. 通常,人们建议使用try-except或regex方法(例如在检查字符串是否可以在Python中转换为float的方法 ),但是事实证明它非常慢。

The question is: how to write that code the fastest way possible? 问题是:如何以最快的方式编写该代码?

I found that there is .isdigit() method of a string. 我发现有一个字符串的.isdigit()方法。 Is there something like that for floats? 花车有类似的东西吗?

Here is the current (slow) code. 这是当前(慢速)代码。

    result = []
    for line in lines:
        resline = []
        for item in line:
            try:
                resline.append(int(item))       
            except:        
                try:
                    resline.append(float(item))     
                except:
                    resline.append(item)            
        result.append(resline)          
    return np.array(result)

There is also some evidence ( https://stackoverflow.com/a/2356970/3642151 ) that regex approach is even slower. 还有一些证据( https://stackoverflow.com/a/2356970/3642151 )表明正则表达式的方法甚至更慢。

Your return value shows you are using NumPy. 您的返回值表明您正在使用NumPy。 Therefore, you should be using np.loadtxt or np.genfromtxt (with the dtype=None parameter) to load the lines into a NumPy array. 因此,您应该使用np.loadtxtnp.genfromtxt (带有dtype dtype=None参数)将行加载到NumPy数组中。 The dtype=None parameter will automatically detect if the string can be converted to a float or int . dtype=None参数将自动检测字符串是否可以转换为floatint

np.loadtxt is faster and requires less memory than np.genfromtxt , but requires you to specify the dtype -- there is no dtype=None automatic-dtype-detection option. np.loadtxt是速度更快,需要内存少np.genfromtxt ,但需要你指定的dtype -有没有dtype=None自动D型检测选项。 See Joe Kington's post for a comparsion . 请参阅Joe Kington的帖子进行比较

If you find loading the CSV using np.loadtxt or np.genfromtxt is still too slow, then using Panda's read_csv function is much much faster , but (of course) would require you to install Pandas first, and the result would be a Pandas DataFrame, not a NumPy array. 如果您发现使用np.loadtxtnp.genfromtxt加载CSV仍然太慢,则使用Panda的read_csv函数要快得多 ,但是(当然)需要先安装Pandas ,结果是Pandas DataFrame ,而不是NumPy数组。 DataFrames have many nice features (and can be converted into NumPy arrays), so you may find this to be an advantage not only in terms of loading speed but also for data manipulation. DataFrame具有许多不错的功能(并且可以转换为NumPy数组),因此您可能会发现这不仅在加载速度方面而且在数据处理方面都是一个优势。


By the way, if you don't specify the dtype in the call 顺便说一句,如果您未在通话中指定dtype

np.array(data)

then np.array uses a single dtype for all the data. 然后np.array对所有数据使用单个np.array If your data contains both ints and floats, then np.array will return an array with a float dtype: 如果您的数据同时包含int和float,则np.array将返回一个带有float np.array的数组:

In [91]: np.array([[1, 2.0]]).dtype
Out[91]: dtype('float64')

Even worse, if your data contains numbers and strings, np.array(data) will return an array of string dtype: 更糟糕的是,如果您的数据包含数字和字符串,则np.array(data)将返回字符串np.array(data)的数组:

In [92]: np.array([[1, 2.0, 'Hi']]).dtype
Out[92]: dtype('S32')

So all the hard work you go through checking which strings are ints or floats gets destroyed in the very last line. 因此,您检查所有字符串为intsfloats所有艰苦工作都会在最后一行被销毁。 np.genfromtxt(..., dtype=None) gets around this problem by returning a structured array (one with heterogenous dtype). np.genfromtxt(..., dtype=None)通过返回结构化数组(具有异类dtype的数组np.genfromtxt(..., dtype=None)来解决此问题。

Try profiling your Python script, you'll find out that try... except , float or int are not the most time consuming calls in your script. 尝试对您的Python脚本进行性能分析,您会发现try... exceptfloatint并不是脚本中最耗时的调用。

import random
import string
import cProfile

def profile_str2float(calls):
    for x in xrange(calls):
        str2float(random_str(100))

def str2float(string):
    try:
        return float(string)
    except ValueError:
        return None

def random_str(length):
    return ''.join(random.choice(string.lowercase) for x in xrange(length))

cProfile.run('profile_str2float(10**5)', sort='cumtime')

Running this script I get the following results: 运行此脚本,我得到以下结果:

         40400003 function calls in 14.721 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   14.721   14.721 <string>:1(<module>)
        1    0.126    0.126   14.721   14.721 str2float.py:5(profile_str2float)
   100000    0.111    0.000   14.352    0.000 str2float.py:15(random_str)
   100000    1.413    0.000   14.241    0.000 {method 'join' of 'str' objects}
 10100000    4.393    0.000   12.829    0.000 str2float.py:16(<genexpr>)
 10000000    7.115    0.000    8.435    0.000 random.py:271(choice)
 10000000    0.760    0.000    0.760    0.000 {method 'random' of '_random.Random' objects}
 10000000    0.559    0.000    0.559    0.000 {len}
   100000    0.242    0.000    0.242    0.000 str2float.py:9(str2float)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

As you can see from the Cumulative Time stat, str2float function is not consuming much CPU time, in 100.000 calls it barely uses 250ms. 从“累积时间”统计信息中可以看到, str2float函数并没有消耗太多的CPU时间,在100.000次调用中,它几乎不使用250ms。

All generalizations are false (irony intended). 所有归纳都是错误的(具有讽刺意味的是)。 One cannot say that try: except: is always faster than regex or vice versa. 不能说try: except:总是比正则表达式快,反之亦然。 In your case, regex is not overkill and would be much faster than the try: except: method. 在您的情况下,正则表达式并不过分,比try: except:方法要快得多。 However, based on our discussions in the comments section of your question, I went ahead and implemented a C library that efficiently performs this conversion (since I see this question a lot on SO); 但是,根据您对问题的评论部分中的讨论,我继续并实现了一个C库,该库有效地执行了此转换(因为我在SO上经常看到此问题); the library is called fastnumbers . 该库称为fastnumbers Below are timing tests using your try: except: method, using regex, and using fastnumbers . 以下是使用try: except:方法,正则表达式和fastnumbers进行的时序测试。


from __future__ import print_function
import timeit

prep_code = '''\
import random
import string
x = [''.join(random.sample(string.ascii_letters, 7)) for _ in range(10)]
y = [str(random.randint(0, 1000)) for _ in range(10)]
z = [str(random.random()) for _ in range(10)]
'''

try_method = '''\
def converter_try(vals):
    resline = []
    for item in vals:
        try:
            resline.append(int(item))
        except ValueError:
            try:
                resline.append(float(item))
            except ValueError:
                resline.append(item)

'''

re_method = '''\
import re
int_match = re.compile(r'[+-]?\d+$').match
float_match = re.compile(r'[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?$').match
def converter_re(vals):
    resline = []
    for item in vals:
        if int_match(item):
            resline.append(int(item))
        elif float_match(item):
            resline.append(float(item))
        else:
            resline.append(item)

'''

fn_method = '''\
from fastnumbers import fast_real
def converter_fn(vals):
    resline = []
    for item in vals:
        resline.append(fast_real(item))

'''

print('Try with non-number strings', timeit.timeit('converter_try(x)', prep_code+try_method), 'seconds')
print('Try with integer strings', timeit.timeit('converter_try(y)', prep_code+try_method), 'seconds')
print('Try with float strings', timeit.timeit('converter_try(z)', prep_code+try_method), 'seconds')
print()
print('Regex with non-number strings', timeit.timeit('converter_re(x)', prep_code+re_method), 'seconds')
print('Regex with integer strings', timeit.timeit('converter_re(y)', prep_code+re_method), 'seconds')
print('Regex with float strings', timeit.timeit('converter_re(z)', prep_code+re_method), 'seconds')
print()
print('fastnumbers with non-number strings', timeit.timeit('converter_fn(x)', prep_code+fn_method), 'seconds')
print('fastnumbers with integer strings', timeit.timeit('converter_fn(y)', prep_code+fn_method), 'seconds')
print('fastnumbers with float strings', timeit.timeit('converter_fn(z)', prep_code+fn_method), 'seconds')
print()

The output looks like this on my machine: 输出在我的机器上如下所示:

Try with non-number strings 55.1374599934 seconds
Try with integer strings 11.8999788761 seconds
Try with float strings 41.8258318901 seconds

Regex with non-number strings 11.5976541042 seconds
Regex with integer strings 18.1302199364 seconds
Regex with float strings 19.1559209824 seconds

fastnumbers with non-number strings 4.02173805237 seconds
fastnumbers with integer strings 4.21903610229 seconds
fastnumbers with float strings 4.96900391579 seconds

A few things are pretty clear 一些事情很清楚

  • try: except: is very slow for non-numeric input; try: except:非数字输入非常慢; regex beats that handily 正则表达式可以轻松击败
  • try: except: becomes more efficient if exceptions don't need to be raised try: except:如果不需要引发异常,则效率更高
  • fastnumbers beats the pants off both in all cases 在所有情况下, fastnumbers

So, if you don't want to use fastnumbers , you need to assess if you are more likely to encounter invalid strings or valid strings, and base your algorithm choice on that. 因此,如果您不想使用fastnumbers ,则需要评估您是否更有可能遇到无效字符串或有效字符串,并以此为基础选择算法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM