简体   繁体   English

使用正则表达式进行部分匹配

[英]Partial Matching with a regular expression

I'm building a system that recognizes pattern-based words such as license plates and fiscal codes.我正在构建一个系统来识别基于模式的词,例如车牌和财政代码。 My input strings are some times not perfect since deriving from an OCR system, so I'm looking for a solution that "relaxes" my regular expression.由于源自 OCR 系统,我的输入字符串有时并不完美,因此我正在寻找一种“放松”正则表达式的解决方案。

Eg of regex :例如正则表达式:

^(([a-zA-Z]{2}\d{3}[a-zA-Z]{2})|(([a-zA-Z]{2}|roma)(\d{5}|\d{6})))$

If I have a (italian) license plate AB123CD and I try to match my regex it works but if my input is slightly corruputed eg ABi23CD (the OCR system reads the '1' as an 'i') the regex doens't obviously match.如果我有(意大利)车牌 AB123CD 并且我尝试匹配我的正则表达式它可以工作但如果我的输入稍微损坏例如 ABi23CD(OCR 系统将“1”读作“i”)正则表达式显然不匹配.

Is there any way to allow some errors in regex matching?有没有办法允许在正则表达式匹配中出现一些错误? In the example, the second string would match the regex substituting the i with a number so it match but allowing one error.在该示例中,第二个字符串将匹配正则表达式,将 i 替换为一个数字,因此它匹配但允许一个错误。 Thanks!谢谢!

OCR errors are hard to predict, but if you analyze the output you have, you may find regularities, like i recognized instead of 1 , etc. OCR 错误很难预测,但是如果您分析您拥有的输出,您可能会发现规律,例如i识别而不是1等。

In that case, you can use character classes: \\d matches a digit, [i\\d] matches i or a digit (but in your case, i is a digit, too, you can replace i with 1 later).在这种情况下,您可以使用字符类: \\d匹配一个数字, [i\\d]匹配i或一个数字(但在您的情况下, i也是一个数字,您可以稍后用1替换i )。

So, your pattern will look like所以,你的模式看起来像

^([a-zA-Z]{2}[\di]{3}[a-zA-Z]{2}|([a-zA-Z]{2}|roma)[\di]{5,6})$

Note (\\d{5}|\\d{6}) can be shortened as \\d{5,6} .注意(\\d{5}|\\d{6})可以缩写为\\d{5,6}

You can add more characters to the [\\di] once you find them.找到后,您可以向[\\di]添加更多字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM