需要找到 python 正则表达式才能得到最后一位

Question

I have a huge pdf that is all very basic text on pages for invoices, I need to create a regex or 2 so when I split it I get the customer number and the invoice number to use in the file name.我有一个巨大的 pdf 是发票页面上的所有非常基本的文本，我需要创建一个或 2 个正则表达式，所以当我拆分它时，我会得到客户编号和发票编号以在文件名中使用。 I am using python 3 and pypdf2 currently我目前正在使用 python 3 和 pypdf2

text example of 2 of the pages: 2页的文本示例：

Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company:  (Multiple Companies) Printed by Robert S on 8/11/2022   1:26:46PM
Donna Contact Cust# Name: Customer A  1234
Customer A Invoice Date Invoice Name 8/12/2015  241849
Item Description Qty Price Extended Price
Credit ($810.00)  1 ($810.00) 1
Due Paid Total Total Taxes Subtotal
($810.00) ($810.00) $0.00 ($810.00)
Balance: ($810.00) $0.00 $0.00 
8/11/2022   1:26:46PM Page 1 of 340977

Detailed Invoice Report
Starting 8/12/2015 and ending 8/11/2022
Company:  (Multiple Companies) Printed by Robert S on 8/11/2022   1:26:46PM
Customer B Cust# Name: Customer B  45678
Customer B Invoice Date Invoice Name 8/12/2015  241850
Item Description Qty Price Extended Price
credit ($49.99)  1 ($49.99) 1
Due Paid Total Total Taxes Subtotal
($49.99) ($49.99) $0.00 ($49.99)
Balance: ($49.99) $0.00 $0.00 
8/11/2022   1:26:46PM Page 2 of 340977

currently I have these 2 regex filters to get each one kind of but I do not know how to only keep the last groups match from them.目前我有这 2 个正则表达式过滤器来获取每一种，但我不知道如何只保留最后一组匹配。 Note: the firstmatch regex is broken if the customer name has a number in it which is an edge case but not uncommon in the data注意：如果客户名称中有一个数字，那么 firstmatch 正则表达式会被破坏，这是一种边缘情况，但在数据中并不少见

firstmatch=r"(Name:)(\D*)(\d+)"
secondmatch=r"(Name )(\d*.\d*.\d*..)(\d*)"

Each one is its own page and I would like the regex to be able to pull from the first one 1234 241849 and the second one 45678 241850每个都是自己的页面，我希望正则表达式能够从第一个 1234 241849 和第二个 45678 241850 中提取

Answer 1

You could get both values using a capture matching the last digits on the line.您可以使用匹配行中最后一位数字的捕获来获取这两个值。

For the first pattern:对于第一个模式：

\bName:.*?\b(\d+)[^\d\n]*$

Explanation解释

\bName: Match Name: preceded by a word boundary \bName:匹配Name:前面有单词边界
.*? Match any character without a newline, as least as possible尽可能少地匹配任何没有换行符的字符
\b(\d+) A word boundary, then capture 1+ digits in group 1 \b(\d+)一个单词边界，然后在第 1 组中捕获 1+ 个数字
[^\d\n]* Optionally match any character except digits or a newline [^\d\n]*可选匹配除数字或换行符以外的任何字符
$ End of string $字符串结尾

Regex demo正则表达式演示

For the second pattern you can make it a bit more specific, where [^\S\n]+ matches 1+ whitespace chars without newlines:对于第二种模式，您可以使其更具体一些，其中[^\S\n]+匹配 1+ 个空白字符而没有换行符：

\bName[^\S\n]+\d+/\d+/\d+[^\S\n]+(\d+)[^\d\n]*$

Regex demo正则表达式演示

Or if the lines are right behind each other, you can also use 1 pattern with 2 capture groups and match the newline at the end of the first line:或者，如果这些行彼此紧随其后，您还可以使用 1 个模式和 2 个捕获组，并匹配第一行末尾的换行符：

\bName:.*?\b(\d+)[^\d\n]*\n\b.*?Name[^\S\n]+\d+/\d+/\d+[^\S\n]+(\d+)[^\d\n]*$

Regex demo正则表达式演示

需要找到 python 正则表达式才能得到最后一位

问题描述

1 个解决方案

解决方案1
1 2022-08-12 17:51:23

需要找到 python 正则表达式才能得到最后一位

问题描述

1 个解决方案

解决方案1 1 2022-08-12 17:51:23

解决方案1
1 2022-08-12 17:51:23