有效检查字符串是否包含python中的数字

Question

I have a huge amount (GB) of text to process, sentence by sentence.我有大量 (GB) 的文本要逐句处理。 In each sentence I have a costly operation to perform on numbers, so I check that this sentence contains at least one digit.在每个句子中，我都要对数字执行昂贵的操作，所以我检查这个句子是否至少包含一个数字。 I have done this check using different means and measured those solutions using timeit .我使用不同的方法完成了这项检查，并使用timeit测量了这些解决方案。

s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' # example

any(c.isdigit() for c in s) 3.61 µs any(c.isdigit() for c in s) 3.61 µs
re.search('\d', s) 402 ns re.search('\d', s) 402 ns
d = re.compile('\d') d.search(s) 126 ns d = re.compile('\d') d.search(s) 126 ns
'0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s 60ns '0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s 60ns

The last way is the fastest one, but it is ugly and probably 10x slower than possible.最后一种方法是最快的，但它很丑，可能比可能的慢 10 倍。

Of course I could rewrite this in cython, but it seems overkill.当然，我可以用 cython 重写它，但这似乎有点矫枉过正。

Is there a better pure python solution?有更好的纯python解决方案吗？ In particular, I wonder why you can use str.startswith() and str.endswith() with a tuple argument, but it does not seem to be possible with in operator.特别是，我想知道为什么您可以将str.startswith()和str.endswith()与元组参数一起使用，但使用in运算符似乎不可能。

Answer 1

Actual performance might vary depending on your platform and python version, but on my setup (python 3.9.5 / Ubuntu), it turns out that re.match is significantly faster than re.search , and outperforms the long in series version.实际性能可能会因您的平台和 python 版本而异，但根据我的设置（python 3.9.5 / Ubuntu），事实证明re.match比re.search快得多，并且优于 long in series 版本。 Also, compiling the regex with [0-9] instead of \d provides a little improvement.此外，使用[0-9]而不是\d编译正则表达式提供了一些改进。

import re
from timeit import timeit

n = 10_000_000
s = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'

# reference
timeit(lambda: '0' in s or '1' in s or '2' in s or '3' in s or '4' in s or '5' in s or '6' in s or '7' in s or '8' in s or '9' in s, number=n)
# 2.1005349759998353

# re.search with \d, slower
re.compile('\d')
timeit(lambda: d.search(s), number=n)
# 2.9816031390000717

# re.search with [0-9], better but still slower then reference
d = re.compile('[0-9]')
timeit(lambda: d.search(s), number=n)
# 2.640713582999524

# re.match with [0-9], faster than reference
d = re.compile('[0-9]')
timeit(lambda: d.match(s), number=n)
# 1.5671786130005785

So, on my machine , using re.match with a compiled [0-9] pattern is about 25% faster than the long or ... in chaining.因此，在我的机器上，将re.match与已编译的[0-9]模式一起使用比链接or ... in快约 25%。 And it looks better too.它看起来也更好。

有效检查字符串是否包含python中的数字

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-25 09:25:56

有效检查字符串是否包含python中的数字

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-25 09:25:56

解决方案1
1 已采纳 2022-05-25 09:25:56