简体   繁体   English

正则表达式-如何解释数字之间没有空格

[英]Regex- How to account for no spaces between numbers

I'm trying to scrape data from a pdf document that has lots of financial information.我正在尝试从包含大量财务信息的 pdf 文档中抓取数据。 I'm a beginner at using regex but I was able to find the number I was specifically looking for which is in the hundreds of millions.我是使用正则表达式的初学者,但我能够找到我专门寻找的数以亿计的数字。 However, there's no space between the end of that number and the start of the next number so I'm having a hard time not including the next number.但是,该数字的结尾和下一个数字的开头之间没有空格,所以我很难不包括下一个数字。

This is the result I'm getting:这是我得到的结果:

['183,662,7203.004.00']

The number I want to scrape is 183,662,720, but as you can see, it's capturing the numbers afterwards since there is no space.我要抓取的数字是 183,662,720,但正如您所见,由于没有空间,它正在捕获之后的数字。

The code I'm using is re.findall('\(line 1 minus line 2\)(\d.+?)Less',y) .我使用的代码是re.findall('\(line 1 minus line 2\)(\d.+?)Less',y) I'll be using this for other versions of this document where there may be numbers in the tens of thousands to billions.我将把它用于本文档的其他版本,其中可能有数万到数十亿的数字。 So that also complicates this a bit.所以这也使这有点复杂。

Any help would be much appreciated, thanks!任何帮助将不胜感激,谢谢!

If you want to use the whole pattern, you might use:如果你想使用整个模式,你可以使用:

\(line 1 minus line 2\)(\d{1,3}(?:,\d{3})*)\d*(?:\.\d+)* Less\b

The pattern matches:模式匹配:

  • \(line 1 minus line 2\) Match (line 1 minus line 2) \(line 1 minus line 2\)匹配(line 1 minus line 2)
  • ( Capture group 1 (捕获组 1
    • \d{1,3}(?:,\d{3})* Match 1+ digits optionally repeated by a , and 3 digits \d{1,3}(?:,\d{3})*匹配 1+ 位可选地由 a 重复,数字和 3 位数字
  • ) Close group 1 )关闭第 1 组
  • \d*(?:\.\d+)* Match optional digits, optionally followed by matching a . \d*(?:\.\d+)*匹配可选数字,可选地后跟匹配 a . and digits和数字
  • Less\b Match Less followed by a word boundary to prevent a partial match Less\b匹配Less后跟单词边界以防止部分匹配

Regex demo |正则表达式演示| Python demo Python 演示

For example例如

import re

y = r"This is (line 1 minus line 2)183,662,7203.004.00 Less test"
print(re.findall(r"\(line 1 minus line 2\)(\d{1,3}(?:,\d{3})*)\d*(?:\.\d+)* Less\b" ,y))

Output Output

['183,662,720']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM