简体   繁体   English

正则表达式和检查出现

[英]Regular expression and checking occurrences

I am actually dealing with regular expressions and i am still trying to understand how to approach properly this kind of problems.我实际上正在处理正则表达式,并且我仍在尝试了解如何正确处理此类问题。

So lets say i have this regular expression:所以可以说我有这个正则表达式:

[A − Z]
∗01∗
[ˆ[A − Z]]{3}

On alphabet [AZ][0-9]在字母表 [AZ][0-9]

First question is: {3} means that there must be atleast 3 characters that belong to a "part" of the regular expression(lets say 3[A − Z]) or it is strictly refering to the last one ([ˆ[A − Z]])?第一个问题是:{3} 意味着必须有至少 3 个字符属于正则表达式的“部分”(比如说 3[A - Z]),或者它严格指代最后一个字符([^[A − Z]])?

My second doubt is: if it is the last one, checking if there are atleast 3 occurrences might be easy(just 3 states that check if the char is a number, otherwise exit),right?我的第二个疑问是:如果它是最后一个,检查是否有至少 3 次出现可能很容易(只有 3 个状态检查 char 是否为数字,否则退出),对吗? Otherwise, if it might be any of the possible part of the regular expression, how do i check without a counter(eventually confirm if i shouldnt be using a counter) how many occurrences repeat in any possible state?否则,如果它可能是正则表达式的任何可能部分,我如何在没有计数器的情况下检查(最终确认我是否不应该使用计数器)在任何可能的 state 中重复出现多少次?

I am not really interested in a solution with code, i just want to fully understand the topic.我对带有代码的解决方案并不真正感兴趣,我只想完全理解这个话题。

Regular expressions are a formal mathematical construction, but syntaxes for describing them may vary.正则表达式是一种正式的数学结构,但用于描述它们的语法可能会有所不同。 In common syntaxes, {3} means the previous item is repeated three times.在常用语法中, {3}表示前一项重复了 3 次。 For example, [AB]{3} is the same as [AB][AB][AB] , so it will match AAA , AAB , ABA , ABA , BAA , BAB , BBA , or BBB .例如, [AB]{3}[AB][AB][AB]相同,因此它将匹配AAAAABABAABABAABABBBABBB Or (AA|B){2} will match AAAA , AAB , BAA , or BB .或者(AA|B){2}将匹配AAAAAABBAABB It does not require there be two characters.它不需要有两个字符。 It requires there be two matches of (AA|B) .它需要有两个匹配(AA|B)

What the “previous item” is may depend on the particular syntax you are using. “上一个项目”是什么可能取决于您使用的特定语法。 For example, in AA|B{2} , either |例如,在AA|B{2}中, | or {…} could be given a higher precedence, so it could be AA|(B{2}) or (AA|B){2} , depending on the rules in your syntax.{…}可以被赋予更高的优先级,因此它可以是AA|(B{2})(AA|B){2} ,具体取决于您的语法规则。 However, in the specific example you asked about, the brackets clearly form a unit, so [ˆ[A − Z]]{3} requires three matches to [ˆ[A − Z]] .但是,在您询问的具体示例中,括号显然形成了一个单位,因此[ˆ[A − Z]]{3}需要与[ˆ[A − Z]] ^[A - Z]] 三个匹配项。 Again assuming a common syntax, [ˆ[A − Z]] means one character that does not match [AZ] , so a character that is not A through Z .再次假设一个通用语法, [ˆ[A − Z]]表示一个不匹配[AZ]的字符,因此不是AZ的字符。 Since your alphabet consists only of A through Z and 0 through 9 , [^[AZ]] matches 0 through 9 .由于您的字母表仅包含AZ09 ,因此[^[AZ]]匹配09

Thus [^[AZ]]{3} matches a three-digit numeral and nothing else.因此[^[AZ]]{3}匹配三位数字,仅此而已。

First, there's a bunch of problems with your regex.首先,您的正则表达式存在很多问题。

I believe your "smart" editor has mangled the regex.我相信您的“智能”编辑器已经破坏了正则表达式。 It's replaced ^ (U+0005E CIRCUMFLEX ACCENT) and - (U+0002D - HYPHEN-MINUS) with the fancy versions: ^ (U+002C6 - MODIFIER LETTER CIRCUMFLEX ACCENT) and (U+02212 - MINUS SIGN).它已替换 ^ (U+0005E CIRCUMFLEX ACCENT) 和 - (U+0002D - HYPHEN-MINUS) 为花哨的版本:^ (U+002C6 - MODIFIER LETTER CIRCUMFLEX ACCENT) 和 (U+02212 - MINUS SIGN)。 They look the same, but they are different characters and have different meanings in a regex.它们看起来相同,但它们是不同的字符,并且在正则表达式中具有不同的含义。 To avoid this, be sure to use a good code editor such as Atom .为避免这种情况,请务必使用良好的代码编辑器,例如Atom

Spaces are also important.空间也很重要。 [A - Z] means something different than [AZ] . [A - Z]含义与[AZ]不同。 So are newlines, they are treated literally.换行符也是如此,它们按字面意思对待。

∗01∗ does not mean to match 01 surrounded by anything. ∗01∗并不意味着匹配被任何东西包围的01 Regexes don't work like file globs.正则表达式不像文件 glob 那样工作。 While * does mean "zero or more" like a file glob, it is "zero or more of the immediately preceding thing".虽然 * 确实像文件 glob 一样表示“零个或多个”,但它是“前一个事物的零个或多个”。 . matches (almost) anything.匹配(几乎)任何东西。 So it would be .*01.* .所以它会是.*01.*

[ˆ[A − Z]]{3} should be [^AZ]{3} . [ˆ[A − Z]]{3}应该是[^AZ]{3} [^...] means to match what is not in the set. [^...]表示匹配不在集合中的内容。 [^AZ]{3} means to match exactly 3 of anything which are not between A and Z. 123 or abc or !@# . [^AZ]{3}表示精确匹配任何不在A 和 Z 之间的 3 个。 123abc!@#

Putting it all together: [AZ].*01.*[^AZ]{3} says to match exactly one character in the set between A and Z, then match anything, then exactly 01 , then anything, then exactly 3 characters which are not in the set between A and Z. C01;;;将它们放在一起: [AZ].*01.*[^AZ]{3}表示要匹配 A 和 Z 之间的集合中的一个字符,然后匹配任何内容,然后是01 ,然后是任何内容,然后是 3 个字符不在A和Z之间的集合中C01;;; and blah blah Z blah 01 blah blah abc both match.blah blah Z blah 01 blah blah abc都匹配。

Regex 101 is a valuable resource for understanding regexes. Regex 101是理解正则表达式的宝贵资源。 Regular-Expressions.info is a very good tutorial site. Regular-Expressions.info是一个非常好的教程网站。

First question is: {3} means that there must be atleast 3 characters that belong to a "part" of the regular expression(lets say 3[A − Z]) or it is strictly refering to the last one ([ˆ[A − Z]])?第一个问题是:{3} 意味着必须有至少 3 个字符属于正则表达式的“部分”(比如说 3[A - Z]),或者它严格指代最后一个字符([^[A − Z]])?

{3} is a "quantifier" . {3}是一个“量词” So are + (one or more), * (zero or more), and ? + (一个或多个)、 * (零个或多个)和? (zero or one). (零或一)。 All quantifiers match the thing immediately preceding it.所有量词都匹配紧接在它前面的事物。 A{3} means "AAA". A{3}表示“AAA”。 [AZ]{3} means exactly three characters in the set of A through Z. [AZ]{3}表示从 A 到 Z 的集合中的三个字符。

My second doubt is: if it is the last one, checking if there are atleast 3 occurrences might be easy(just 3 states that check if the char is a number, otherwise exit),right?我的第二个疑问是:如果它是最后一个,检查是否有至少 3 次出现可能很容易(只有 3 个状态检查 char 是否为数字,否则退出),对吗? Otherwise, if it might be any of the possible part of the regular expression, how do i check without a counter(eventually confirm if i shouldnt be using a counter) how many occurrences repeat in any possible state?否则,如果它可能是正则表达式的任何可能部分,我如何在没有计数器的情况下检查(最终确认我是否不应该使用计数器)在任何可能的 state 中重复出现多少次?

Regular expressions are insanely complicated.正则表达式非常复杂。 They are a language unto themselves.它们本身就是一种语言。 Unless this is for a class, use a regular expression library such as PCRE .除非这是针对 class 的,否则请使用诸如PCRE之类的正则表达式库。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM