[英]Fast checking if a string can be converted to float or int in python
I need to convert all strings in a large array to int or float types, if they can be converted. 我需要将大数组中的所有字符串都转换为int或float类型,如果它们可以转换的话。 Usually, people suggest try-except or regex approach (like in Checking if a string can be converted to float in Python ), but it turns out to be very slow.
通常,人们建议使用try-except或regex方法(例如在检查字符串是否可以在Python中转换为float的方法 ),但是事实证明它非常慢。
The question is: how to write that code the fastest way possible? 问题是:如何以最快的方式编写该代码?
I found that there is .isdigit() method of a string. 我发现有一个字符串的.isdigit()方法。 Is there something like that for floats?
花车有类似的东西吗?
Here is the current (slow) code. 这是当前(慢速)代码。
result = []
for line in lines:
resline = []
for item in line:
try:
resline.append(int(item))
except:
try:
resline.append(float(item))
except:
resline.append(item)
result.append(resline)
return np.array(result)
There is also some evidence ( https://stackoverflow.com/a/2356970/3642151 ) that regex approach is even slower. 还有一些证据( https://stackoverflow.com/a/2356970/3642151 )表明正则表达式的方法甚至更慢。
Your return value shows you are using NumPy. 您的返回值表明您正在使用NumPy。 Therefore, you should be using np.loadtxt or np.genfromtxt (with the
dtype=None
parameter) to load the lines into a NumPy array. 因此,您应该使用np.loadtxt或np.genfromtxt (带有dtype
dtype=None
参数)将行加载到NumPy数组中。 The dtype=None
parameter will automatically detect if the string can be converted to a float
or int
. dtype=None
参数将自动检测字符串是否可以转换为float
或int
。
np.loadtxt
is faster and requires less memory than np.genfromtxt
, but requires you to specify the dtype
-- there is no dtype=None
automatic-dtype-detection option. np.loadtxt
是速度更快,需要内存少np.genfromtxt
,但需要你指定的dtype
-有没有dtype=None
自动D型检测选项。 See Joe Kington's post for a comparsion . 请参阅Joe Kington的帖子进行比较 。
If you find loading the CSV using np.loadtxt
or np.genfromtxt
is still too slow, then using Panda's read_csv
function is much much faster , but (of course) would require you to install Pandas first, and the result would be a Pandas DataFrame, not a NumPy array. 如果您发现使用
np.loadtxt
或np.genfromtxt
加载CSV仍然太慢,则使用Panda的read_csv
函数要快得多 ,但是(当然)需要先安装Pandas ,结果是Pandas DataFrame ,而不是NumPy数组。 DataFrames have many nice features (and can be converted into NumPy arrays), so you may find this to be an advantage not only in terms of loading speed but also for data manipulation. DataFrame具有许多不错的功能(并且可以转换为NumPy数组),因此您可能会发现这不仅在加载速度方面而且在数据处理方面都是一个优势。
By the way, if you don't specify the dtype in the call 顺便说一句,如果您未在通话中指定dtype
np.array(data)
then np.array
uses a single dtype for all the data. 然后
np.array
对所有数据使用单个np.array
。 If your data contains both ints and floats, then np.array
will return an array with a float dtype: 如果您的数据同时包含int和float,则
np.array
将返回一个带有float np.array
的数组:
In [91]: np.array([[1, 2.0]]).dtype
Out[91]: dtype('float64')
Even worse, if your data contains numbers and strings, np.array(data)
will return an array of string dtype: 更糟糕的是,如果您的数据包含数字和字符串,则
np.array(data)
将返回字符串np.array(data)
的数组:
In [92]: np.array([[1, 2.0, 'Hi']]).dtype
Out[92]: dtype('S32')
So all the hard work you go through checking which strings are ints
or floats
gets destroyed in the very last line. 因此,您检查所有字符串为
ints
或floats
所有艰苦工作都会在最后一行被销毁。 np.genfromtxt(..., dtype=None)
gets around this problem by returning a structured array (one with heterogenous dtype). np.genfromtxt(..., dtype=None)
通过返回结构化数组(具有异类dtype的数组np.genfromtxt(..., dtype=None)
来解决此问题。
Try profiling your Python script, you'll find out that try... except
, float
or int
are not the most time consuming calls in your script. 尝试对您的Python脚本进行性能分析,您会发现
try... except
, float
或int
并不是脚本中最耗时的调用。
import random
import string
import cProfile
def profile_str2float(calls):
for x in xrange(calls):
str2float(random_str(100))
def str2float(string):
try:
return float(string)
except ValueError:
return None
def random_str(length):
return ''.join(random.choice(string.lowercase) for x in xrange(length))
cProfile.run('profile_str2float(10**5)', sort='cumtime')
Running this script I get the following results: 运行此脚本,我得到以下结果:
40400003 function calls in 14.721 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 14.721 14.721 <string>:1(<module>)
1 0.126 0.126 14.721 14.721 str2float.py:5(profile_str2float)
100000 0.111 0.000 14.352 0.000 str2float.py:15(random_str)
100000 1.413 0.000 14.241 0.000 {method 'join' of 'str' objects}
10100000 4.393 0.000 12.829 0.000 str2float.py:16(<genexpr>)
10000000 7.115 0.000 8.435 0.000 random.py:271(choice)
10000000 0.760 0.000 0.760 0.000 {method 'random' of '_random.Random' objects}
10000000 0.559 0.000 0.559 0.000 {len}
100000 0.242 0.000 0.242 0.000 str2float.py:9(str2float)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
As you can see from the Cumulative Time stat, str2float
function is not consuming much CPU time, in 100.000 calls it barely uses 250ms. 从“累积时间”统计信息中可以看到,
str2float
函数并没有消耗太多的CPU时间,在100.000次调用中,它几乎不使用250ms。
All generalizations are false (irony intended). 所有归纳都是错误的(具有讽刺意味的是)。 One cannot say that
try: except:
is always faster than regex or vice versa. 不能说
try: except:
总是比正则表达式快,反之亦然。 In your case, regex is not overkill and would be much faster than the try: except:
method. 在您的情况下,正则表达式并不过分,比
try: except:
方法要快得多。 However, based on our discussions in the comments section of your question, I went ahead and implemented a C library that efficiently performs this conversion (since I see this question a lot on SO); 但是,根据您对问题的评论部分中的讨论,我继续并实现了一个C库,该库有效地执行了此转换(因为我在SO上经常看到此问题); the library is called fastnumbers .
该库称为fastnumbers 。 Below are timing tests using your
try: except:
method, using regex, and using fastnumbers
. 以下是使用
try: except:
方法,正则表达式和fastnumbers
进行的时序测试。
from __future__ import print_function
import timeit
prep_code = '''\
import random
import string
x = [''.join(random.sample(string.ascii_letters, 7)) for _ in range(10)]
y = [str(random.randint(0, 1000)) for _ in range(10)]
z = [str(random.random()) for _ in range(10)]
'''
try_method = '''\
def converter_try(vals):
resline = []
for item in vals:
try:
resline.append(int(item))
except ValueError:
try:
resline.append(float(item))
except ValueError:
resline.append(item)
'''
re_method = '''\
import re
int_match = re.compile(r'[+-]?\d+$').match
float_match = re.compile(r'[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?$').match
def converter_re(vals):
resline = []
for item in vals:
if int_match(item):
resline.append(int(item))
elif float_match(item):
resline.append(float(item))
else:
resline.append(item)
'''
fn_method = '''\
from fastnumbers import fast_real
def converter_fn(vals):
resline = []
for item in vals:
resline.append(fast_real(item))
'''
print('Try with non-number strings', timeit.timeit('converter_try(x)', prep_code+try_method), 'seconds')
print('Try with integer strings', timeit.timeit('converter_try(y)', prep_code+try_method), 'seconds')
print('Try with float strings', timeit.timeit('converter_try(z)', prep_code+try_method), 'seconds')
print()
print('Regex with non-number strings', timeit.timeit('converter_re(x)', prep_code+re_method), 'seconds')
print('Regex with integer strings', timeit.timeit('converter_re(y)', prep_code+re_method), 'seconds')
print('Regex with float strings', timeit.timeit('converter_re(z)', prep_code+re_method), 'seconds')
print()
print('fastnumbers with non-number strings', timeit.timeit('converter_fn(x)', prep_code+fn_method), 'seconds')
print('fastnumbers with integer strings', timeit.timeit('converter_fn(y)', prep_code+fn_method), 'seconds')
print('fastnumbers with float strings', timeit.timeit('converter_fn(z)', prep_code+fn_method), 'seconds')
print()
The output looks like this on my machine: 输出在我的机器上如下所示:
Try with non-number strings 55.1374599934 seconds
Try with integer strings 11.8999788761 seconds
Try with float strings 41.8258318901 seconds
Regex with non-number strings 11.5976541042 seconds
Regex with integer strings 18.1302199364 seconds
Regex with float strings 19.1559209824 seconds
fastnumbers with non-number strings 4.02173805237 seconds
fastnumbers with integer strings 4.21903610229 seconds
fastnumbers with float strings 4.96900391579 seconds
A few things are pretty clear 一些事情很清楚
try: except:
is very slow for non-numeric input; try: except:
非数字输入非常慢; regex beats that handily try: except:
becomes more efficient if exceptions don't need to be raised try: except:
如果不需要引发异常,则效率更高 fastnumbers
beats the pants off both in all cases fastnumbers
So, if you don't want to use fastnumbers
, you need to assess if you are more likely to encounter invalid strings or valid strings, and base your algorithm choice on that. 因此,如果您不想使用
fastnumbers
,则需要评估您是否更有可能遇到无效字符串或有效字符串,并以此为基础选择算法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.