简体   繁体   English

需要找到 python 正则表达式才能得到最后一位

[英]need to find python regex to get only the last digit

I have a huge pdf that is all very basic text on pages for invoices, I need to create a regex or 2 so when I split it I get the customer number and the invoice number to use in the file name.我有一个巨大的 pdf 是发票页面上的所有非常基本的文本,我需要创建一个或 2 个正则表达式,所以当我拆分它时,我会得到客户编号和发票编号以在文件名中使用。 I am using python 3 and pypdf2 currently我目前正在使用 python 3 和 pypdf2

text example of 2 of the pages: 2页的文本示例:

Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company:  (Multiple Companies) Printed by Robert S on 8/11/2022   1:26:46PM
Donna Contact Cust# Name: Customer A  1234
Customer A Invoice Date Invoice Name 8/12/2015  241849
Item Description Qty Price Extended Price
Credit ($810.00)  1 ($810.00) 1
Due Paid Total Total Taxes Subtotal
($810.00) ($810.00) $0.00 ($810.00)
Balance: ($810.00) $0.00 $0.00 
8/11/2022   1:26:46PM Page 1 of 340977

Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company:  (Multiple Companies) Printed by Robert S on 8/11/2022   1:26:46PM
Customer B Cust# Name: Customer B  45678
Customer B Invoice Date Invoice Name 8/12/2015  241850
Item Description Qty Price Extended Price
credit ($49.99)  1 ($49.99) 1
Due Paid Total Total Taxes Subtotal
($49.99) ($49.99) $0.00 ($49.99)
Balance: ($49.99) $0.00 $0.00 
8/11/2022   1:26:46PM Page 2 of 340977

currently I have these 2 regex filters to get each one kind of but I do not know how to only keep the last groups match from them.目前我有这 2 个正则表达式过滤器来获取每一种,但我不知道如何只保留最后一组匹配。 Note: the firstmatch regex is broken if the customer name has a number in it which is an edge case but not uncommon in the data注意:如果客户名称中有一个数字,那么 firstmatch 正则表达式会被破坏,这是一种边缘情况,但在数据中并不少见

firstmatch=r"(Name:)(\D*)(\d+)"
secondmatch=r"(Name )(\d*.\d*.\d*..)(\d*)"

Each one is its own page and I would like the regex to be able to pull from the first one 1234 241849 and the second one 45678 241850每个都是自己的页面,我希望正则表达式能够从第一个 1234 241849 和第二个 45678 241850 中提取

You could get both values using a capture matching the last digits on the line.您可以使用匹配行中最后一位数字的捕获来获取这两个值。

For the first pattern:对于第一个模式:

\bName:.*?\b(\d+)[^\d\n]*$

Explanation解释

  • \bName: Match Name: preceded by a word boundary \bName:匹配Name:前面有单词边界
  • .*? Match any character without a newline, as least as possible尽可能少地匹配任何没有换行符的字符
  • \b(\d+) A word boundary, then capture 1+ digits in group 1 \b(\d+)一个单词边界,然后在第 1 组中捕获 1+ 个数字
  • [^\d\n]* Optionally match any character except digits or a newline [^\d\n]*可选匹配除数字或换行符以外的任何字符
  • $ End of string $字符串结尾

Regex demo正则表达式演示

For the second pattern you can make it a bit more specific, where [^\S\n]+ matches 1+ whitespace chars without newlines:对于第二种模式,您可以使其更具体一些,其中[^\S\n]+匹配 1+ 个空白字符而没有换行符:

\bName[^\S\n]+\d+/\d+/\d+[^\S\n]+(\d+)[^\d\n]*$

Regex demo正则表达式演示

Or if the lines are right behind each other, you can also use 1 pattern with 2 capture groups and match the newline at the end of the first line:或者,如果这些行彼此紧随其后,您还可以使用 1 个模式和 2 个捕获组,并匹配第一行末尾的换行符:

\bName:.*?\b(\d+)[^\d\n]*\n\b.*?Name[^\S\n]+\d+/\d+/\d+[^\S\n]+(\d+)[^\d\n]*$

Regex demo正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM