[英]Efficiently check if string contains a digit in python
I have a huge amount (GB) of text to process, sentence by sentence.我有大量 (GB) 的文本要逐句处理。 In each sentence I have a costly operation to perform on numbers, so I check that this sentence contains at least one digit.
在每个句子中,我都要对数字执行昂贵的操作,所以我检查这个句子是否至少包含一个数字。 I have done this check using different means and measured those solutions using
timeit
.我使用不同的方法完成了这项检查,并使用
timeit
测量了这些解决方案。
s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' # example
any(c.isdigit() for c in s)
3.61 µs any(c.isdigit() for c in s)
3.61 µs
re.search('\d', s)
402 ns re.search('\d', s)
402 ns
d = re.compile('\d')
d.search(s)
126 ns d = re.compile('\d')
d.search(s)
126 ns
'0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s
60ns '0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s
60ns
The last way is the fastest one, but it is ugly and probably 10x slower than possible.最后一种方法是最快的,但它很丑,可能比可能的慢 10 倍。
Of course I could rewrite this in cython, but it seems overkill.当然,我可以用 cython 重写它,但这似乎有点矫枉过正。
Is there a better pure python solution?有更好的纯python解决方案吗? In particular, I wonder why you can use
str.startswith()
and str.endswith()
with a tuple argument, but it does not seem to be possible with in
operator.特别是,我想知道为什么您可以将
str.startswith()
和str.endswith()
与元组参数一起使用,但使用in
运算符似乎不可能。
Actual performance might vary depending on your platform and python version, but on my setup (python 3.9.5 / Ubuntu), it turns out that re.match
is significantly faster than re.search
, and outperforms the long in
series version.实际性能可能会因您的平台和 python 版本而异,但根据我的设置(python 3.9.5 / Ubuntu),事实证明
re.match
比re.search
快得多,并且优于 long in
series 版本。 Also, compiling the regex with [0-9]
instead of \d
provides a little improvement.此外,使用
[0-9]
而不是\d
编译正则表达式提供了一些改进。
import re
from timeit import timeit
n = 10_000_000
s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'
# reference
timeit(lambda: '0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s, number=n)
# 2.1005349759998353
# re.search with \d, slower
re.compile('\d')
timeit(lambda: d.search(s), number=n)
# 2.9816031390000717
# re.search with [0-9], better but still slower then reference
d = re.compile('[0-9]')
timeit(lambda: d.search(s), number=n)
# 2.640713582999524
# re.match with [0-9], faster than reference
d = re.compile('[0-9]')
timeit(lambda: d.match(s), number=n)
# 1.5671786130005785
So, on my machine , using re.match
with a compiled [0-9]
pattern is about 25% faster than the long or ... in
chaining.因此,在我的机器上,将
re.match
与已编译的[0-9]
模式一起使用比链接or ... in
快约 25%。 And it looks better too.它看起来也更好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.