简体   繁体   English

用于捕获描述中包含单价的发票行项目的正则表达式模式

[英]RegEx pattern to capture invoice line items containing unit prices in description

Using C#, I am attempting to extract individual invoice line items from a block of text containing ALL the line items.使用 C#,我试图从包含所有行项目的文本块中提取单个发票行项目。 For each line item, I want to separate and capture the Line Item Code, Line Item Description, and Line Item Dollar Amount.对于每个行项目,我想分离并捕获行项目代码、行项目描述和行项目美元金额。 The issue is that many of the line item descriptions include decimal amounts similar to dollar amounts, so the regex I am using is capturing several entire line items into one line item description.问题是许多订单项描述包含类似于美元金额的小数金额,因此我使用的正则表达式将几个完整的订单项捕获到一个订单项描述中。 How can I alter my regex statement to include these decimal numbers in the description, while still separating prices into another match group?如何更改我的正则表达式语句以在描述中包含这些十进制数字,同时仍将价格分隔到另一个匹配组中? I am also open to other optimization suggestions我也愿意接受其他优化建议

Here is the block of line items that is giving me trouble:这是给我带来麻烦的订单项块:

1244 Drayage Charge MEDU2265085
1,875.00
4083 Chassis MEDU2265085 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MEDU2265085
250.00
1248 Truck Waiting & Over Time MEDU2265085 3.5*120
420.00
1244 Drayage Charge MEDU3325790
1,875.00
4083 Chassis MEDU3325790 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MEDU3325790
250.00
1248 Truck Waiting & Over Time MEDU3325790 2.38*120
285.60
1244 Drayage Charge MSCU3870551
1,875.00
4083 Chassis MSCU3870551 TRIAXLE 4 DAYS
640.00
1268 Pre-Pull MSCU3870551
250.00
1248 Truck Waiting & Over Time MSCU3870551 3.5*120
420.00

And here is my best attempt at a regex pattern:这是我对正则表达式模式的最佳尝试:

(?<LINE_ITEM_CODE>[0-9]{4})[\r\s\n](?<LINE_ITEM_DESCRIPTION>[A-Za-z0-9\r\s\n\-\%\&\*\.]*)[\r\n\s](?<LINE_ITEM_AMOUNT>[0-9\,]{1,7}.[0-9]{2})

If you punch these in over at regexr.com or regexstorm.net, you'll see that several of the line items are being captured as a single line item description.如果您在 regexr.com 或 regexstorm.net 上输入这些内容,您会看到多个行项目被捕获为单个行项目描述。 The alternative I had been using previously did not accommodate the 3.5, 2.38 etc. How can I target the prices while still grouping the other decimals into the description?我之前使用的替代方案不适合 3.5、2.38 等。如何在将其他小数分组到描述中的同时定位价格?

I'm open to alternative solutions我愿意接受替代解决方案

You can use您可以使用

(?m)^(?<LINE_ITEM_CODE>\d{4})\s+(?<LINE_ITEM_DESCRIPTION>.*?)\r?\n(?<LINE_ITEM_AMOUNT>\d{1,3}(?:,\d{3})*\.\d{2})

See the regex demo .请参阅正则表达式演示

Details :详情

  • (?m)^ - a multiline flag that makes ^ match start of a line (?m)^ - 使^匹配一行开头的多行标志
  • (?<LINE_ITEM_CODE>\d{4}) - Group "LINE_ITEM_CODE": four digits (?<LINE_ITEM_CODE>\d{4}) - 组“LINE_ITEM_CODE”:四位数字
  • \s+ - one or more whitespaces (including newlines) \s+ - 一个或多个空格(包括换行符)
  • (?<LINE_ITEM_DESCRIPTION>.*?) - Group "LINE_ITEM_DESCRIPTION": any zero or more chars other than newline chars as few as possible (?<LINE_ITEM_DESCRIPTION>.*?) - 组“LINE_ITEM_DESCRIPTION”:除换行符外的任何零个或多个字符尽可能少
  • \r?\n - CRLF or LF \r?\n - CRLF 或 LF
  • (?<LINE_ITEM_AMOUNT>\d{1,3}(?:,\d{3})*\.\d{2}) - Group "LINE_ITEM_AMOUNT": one to three digits and then zero or more repetitions of a comma and three digits and then a dot and two digits. (?<LINE_ITEM_AMOUNT>\d{1,3}(?:,\d{3})*\.\d{2}) - 组“LINE_ITEM_AMOUNT”:一到三位数字,然后重复零次或多次逗号和三个数字,然后是一个点和两个数字。 ` `

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM