简体   繁体   English

如何编写正则表达式来捕获特定的数字格式并排除 rest?

[英]How to write regex to capture specific number formats and exclude the rest?

I am trying to capture limited true cases from a string with many other invalid number cases in Python Regex.我试图从 Python Regex 中的许多其他无效数字案例的字符串中捕获有限的真实案例。 The true cases are effectively valid number format with commas or number with commas and decimal.真正的情况是有效的有效数字格式,带有逗号或带有逗号和小数的数字。 Everything else is invalid.其他一切都是无效的。 Sample is below.示例如下。

Sample input string:示例输入字符串:

input = "The net value is 1,000,000.00 however even 100,000 or 1,000,000 is acceptable. The amounts that are not acceptable are 1 000,000.00 or 1.000.000.00 or 1,000,000,00 or 1,000,000,0000"

Output is: 1,000,000.00 100,000 1,000,000 Output 为: 1,000,000.00 100,000 1,000,000

The current python regex I tried is as follows:我尝试的当前 python 正则表达式如下:

\d{1,3}(,{1}\d{3})*(\.{1}\d+){0,1}$

This only works when the input is just numbers.这仅在输入只是数字时才有效。 When I try to input numbers with words around them it fails.当我尝试输入带有单词的数字时,它失败了。

The following regex pattern gets closer to what you want here:以下正则表达式模式更接近您想要的:

(?<!\S)[1-9]\d{0,2}(?:,\d{3})*(?:\.\d+)?(?!\S)

This uses lookarounds to assert that boundaries for the numbers must be either whitespace or the start/end of the input.这使用环视来断言数字的边界必须是空格或输入的开始/结束。 Also note that we insist that each valid number not start with zero.另请注意,我们坚持每个有效数字以零开头。

I would use re.findall as follows:我会使用re.findall如下:

inp = "The net value is 1,000,000.00 however even 100,000 or 1,000,000 is acceptable. The amounts that are not acceptable are 1 000,000.00 or 1.000.000.00 or 1,000,000,00 or 1,000,000,0000"
matches = re.findall(r'(?<!\S)[1-9]\d{0,2}(?:,\d{3})*(?:\.\d+)?(?!\S)', inp)
print(matches)

This prints:这打印:

['1,000,000.00', '100,000', '1,000,000', '1']

As a note on why 1 appears as a result above, there is no obvious way to know that the stanadalone number 1 is actually part of the broken one million number.作为上面为什么会出现1的注释,没有明显的方法可以知道 stanadalone 数字1实际上是破百万数字的一部分。

Another option is to rule out that there are only zeroes before the first comma using a negative lookahead, and match at least a single comma after the value as your desired output is 1,000,000.00 100,000 1,000,000另一种选择是使用负前瞻来排除第一个逗号之前只有零,并且在值之后至少匹配一个逗号,因为您想要的 output 是1,000,000.00 100,000 1,000,000

(?<!\S)(?!0+\,)\d{1,3}(?:,\d{3})+(?:\.\d+)?(?!\S)

Explanation解释

  • (?<!\S) Assert a whitespace boundary to the left (?<!\S)向左断言空白边界
  • (?,0+\,) Assert not only zeroes before the first comma (?,0+\,)在第一个逗号之前不仅断言零
  • \d{1,3} Match 1-3 digits \d{1,3}匹配 1-3 位数字
  • (?:,\d{3})+ Repeat 1+ times matching a comma and 1-3 digits (?:,\d{3})+重复 1+ 次匹配逗号和 1-3 位数字
  • (?:\.\d+)? Optionally match a dot and 1+ digits可选择匹配一个点和 1 个以上的数字
  • (?!\S) Assert a whitespace boundary at the right (?!\S)在右边断言一个空白边界

Regex demo |正则表达式演示| Python demo Python 演示

Example例子

import re
 
input = "The net value is 1,000,000.00 however even 100,000 or 1,000,000 is acceptable. The amounts that are not acceptable are 1 000,000.00 or 1.000.000.00 or 1,000,000,00 or 1,000,000,0000"
regex = r"(?<!\S)(?!0+\,)\d{1,3}(?:,\d{3})+(?:\.\d+)?(?!\S)"
 
print(re.findall(regex, input))

Output Output

['1,000,000.00', '100,000', '1,000,000']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM