正则表达式提取整数

Question

I need help in extracting number from a column that store texts.我需要帮助从存储文本的列中提取数字。 In the text, there can be also some prices that I don't want to extract.在文本中，也可能有一些我不想提取的价格。 As an example, if I have the following text:例如，如果我有以下文本：

text = "I have the following products 4526 and 4. The first one I paid $40 while the second one 30€. 
Here the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"

My expected result would be我的预期结果是

[4526, 4]

Right now what I have used the following regular expression现在我使用了以下正则表达式

'(?<![\d.])[0-9]+(?![\d.])'

which is able to discard the 3.99 but still it recognize the prices and the number in the link.它能够丢弃 3.99，但仍然可以识别链接中的价格和数字。 Any suggestion on how to update the re?关于如何更新 re 的任何建议？

Answer 1

You can assert a whitspace boundary to the left, and exclude matching a dot followed by a digit or the euro sign.您可以在左侧声明一个空白边界，并排除匹配后跟数字或欧元符号的点。

(?<!\S)\d+\b(?!€|\.\d)

(?<!\S) Assert not a non whitespace char to the left (A whitespace boundary) (?<!\S)断言左侧不是非空白字符（空白边界）
\d+ Match 1+ digits \d+匹配 1+ 个数字
\b A word boundary to prevent a partial match \b防止部分匹配的单词边界
(?.€|\.\d) Negative lookahead to assert what is directly to the right is not € or . (?.€|\.\d)否定前瞻断言直接在右边的不是€或. followed by a digit.后跟一个数字。

Regex demo |正则表达式演示| Python demo Python 演示

Example例子

import re
 
pattern = r"(?<!\S)\d+\b(?!€|\.\d)"
s = ("I have the following products 4526 and 4. The first one I paid $40 while the second one 30€. \n"
    "Here the link for the discount of 3.99: https://w...content-available-to-author-only...d.coom/7574@5757\n")
 
print(re.findall(pattern, s))

Output Output

['4526', '4']

Answer 2

Use利用

(?<!\S)[0-9]+(?!\.\d|[^\s!?.])

See proof .见证明。

EXPLANATION解释

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  [0-9]+                   any character of: '0' to '9' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \.                       '.'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [^\s!?.]                 any character except: whitespace (\n,
                             \r, \t, \f, and " "), '!', '?', '.'
--------------------------------------------------------------------------------
  )                        end of look-ahead

Python code : Python 代码：

import re
regex = r"(?<!\S)[0-9]+(?!\.\d|[^\s!?.])"
test_str = "I have the following products 4526 and 4. The first one I paid $40 while the second one 30€. \nHere the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"
matches = re.findall(regex, test_str)
print(matches)

Results : ['4526', '4']结果： ['4526', '4']

正则表达式提取整数

问题描述

2 个解决方案

解决方案1
2 2021-05-14 18:36:49

解决方案2
1 已采纳 2021-05-14 18:40:55

正则表达式提取整数

问题描述

2 个解决方案

解决方案1 2 2021-05-14 18:36:49

解决方案2 1 已采纳 2021-05-14 18:40:55

解决方案1
2 2021-05-14 18:36:49

解决方案2
1 已采纳 2021-05-14 18:40:55