[英]Regular expression and checking occurrences
I am actually dealing with regular expressions and i am still trying to understand how to approach properly this kind of problems.我实际上正在处理正则表达式,并且我仍在尝试了解如何正确处理此类问题。
So lets say i have this regular expression:所以可以说我有这个正则表达式:
[A − Z]
∗01∗
[ˆ[A − Z]]{3}
On alphabet [AZ][0-9]在字母表 [AZ][0-9]
First question is: {3} means that there must be atleast 3 characters that belong to a "part" of the regular expression(lets say 3[A − Z]) or it is strictly refering to the last one ([ˆ[A − Z]])?第一个问题是:{3} 意味着必须有至少 3 个字符属于正则表达式的“部分”(比如说 3[A - Z]),或者它严格指代最后一个字符([^[A − Z]])?
My second doubt is: if it is the last one, checking if there are atleast 3 occurrences might be easy(just 3 states that check if the char is a number, otherwise exit),right?我的第二个疑问是:如果它是最后一个,检查是否有至少 3 次出现可能很容易(只有 3 个状态检查 char 是否为数字,否则退出),对吗? Otherwise, if it might be any of the possible part of the regular expression, how do i check without a counter(eventually confirm if i shouldnt be using a counter) how many occurrences repeat in any possible state?
否则,如果它可能是正则表达式的任何可能部分,我如何在没有计数器的情况下检查(最终确认我是否不应该使用计数器)在任何可能的 state 中重复出现多少次?
I am not really interested in a solution with code, i just want to fully understand the topic.我对带有代码的解决方案并不真正感兴趣,我只想完全理解这个话题。
Regular expressions are a formal mathematical construction, but syntaxes for describing them may vary.正则表达式是一种正式的数学结构,但用于描述它们的语法可能会有所不同。 In common syntaxes,
{3}
means the previous item is repeated three times.在常用语法中,
{3}
表示前一项重复了 3 次。 For example, [AB]{3}
is the same as [AB][AB][AB]
, so it will match AAA
, AAB
, ABA
, ABA
, BAA
, BAB
, BBA
, or BBB
.例如,
[AB]{3}
与[AB][AB][AB]
相同,因此它将匹配AAA
、 AAB
、 ABA
、 ABA
、 BAA
、 BAB
、 BBA
或BBB
。 Or (AA|B){2}
will match AAAA
, AAB
, BAA
, or BB
.或者
(AA|B){2}
将匹配AAAA
、 AAB
、 BAA
或BB
。 It does not require there be two characters.它不需要有两个字符。 It requires there be two matches of
(AA|B)
.它需要有两个匹配
(AA|B)
。
What the “previous item” is may depend on the particular syntax you are using. “上一个项目”是什么可能取决于您使用的特定语法。 For example, in
AA|B{2}
, either |
例如,在
AA|B{2}
中, |
or {…}
could be given a higher precedence, so it could be AA|(B{2})
or (AA|B){2}
, depending on the rules in your syntax.或
{…}
可以被赋予更高的优先级,因此它可以是AA|(B{2})
或(AA|B){2}
,具体取决于您的语法规则。 However, in the specific example you asked about, the brackets clearly form a unit, so [ˆ[A − Z]]{3}
requires three matches to [ˆ[A − Z]]
.但是,在您询问的具体示例中,括号显然形成了一个单位,因此
[ˆ[A − Z]]{3}
需要与[ˆ[A − Z]]
^[A - Z]] 三个匹配项。 Again assuming a common syntax, [ˆ[A − Z]]
means one character that does not match [AZ]
, so a character that is not A
through Z
.再次假设一个通用语法,
[ˆ[A − Z]]
表示一个不匹配[AZ]
的字符,因此不是A
到Z
的字符。 Since your alphabet consists only of A
through Z
and 0
through 9
, [^[AZ]]
matches 0
through 9
.由于您的字母表仅包含
A
到Z
和0
到9
,因此[^[AZ]]
匹配0
到9
。
Thus [^[AZ]]{3}
matches a three-digit numeral and nothing else.因此
[^[AZ]]{3}
匹配三位数字,仅此而已。
First, there's a bunch of problems with your regex.首先,您的正则表达式存在很多问题。
I believe your "smart" editor has mangled the regex.我相信您的“智能”编辑器已经破坏了正则表达式。 It's replaced ^ (U+0005E CIRCUMFLEX ACCENT) and - (U+0002D - HYPHEN-MINUS) with the fancy versions: ^ (U+002C6 - MODIFIER LETTER CIRCUMFLEX ACCENT) and (U+02212 - MINUS SIGN).
它已替换 ^ (U+0005E CIRCUMFLEX ACCENT) 和 - (U+0002D - HYPHEN-MINUS) 为花哨的版本:^ (U+002C6 - MODIFIER LETTER CIRCUMFLEX ACCENT) 和 (U+02212 - MINUS SIGN)。 They look the same, but they are different characters and have different meanings in a regex.
它们看起来相同,但它们是不同的字符,并且在正则表达式中具有不同的含义。 To avoid this, be sure to use a good code editor such as Atom .
为避免这种情况,请务必使用良好的代码编辑器,例如Atom 。
Spaces are also important.空间也很重要。
[A - Z]
means something different than [AZ]
. [A - Z]
含义与[AZ]
不同。 So are newlines, they are treated literally.换行符也是如此,它们按字面意思对待。
∗01∗
does not mean to match 01
surrounded by anything. ∗01∗
并不意味着匹配被任何东西包围的01
。 Regexes don't work like file globs.正则表达式不像文件 glob 那样工作。 While * does mean "zero or more" like a file glob, it is "zero or more of the immediately preceding thing".
虽然 * 确实像文件 glob 一样表示“零个或多个”,但它是“前一个事物的零个或多个”。
.
matches (almost) anything.匹配(几乎)任何东西。 So it would be
.*01.*
.所以它会是
.*01.*
。
[ˆ[A − Z]]{3}
should be [^AZ]{3}
. [ˆ[A − Z]]{3}
应该是[^AZ]{3}
。 [^...]
means to match what is not in the set. [^...]
表示匹配不在集合中的内容。 [^AZ]{3}
means to match exactly 3 of anything which are not between A and Z. 123
or abc
or !@#
. [^AZ]{3}
表示精确匹配任何不在A 和 Z 之间的 3 个。 123
或abc
或!@#
。
Putting it all together: [AZ].*01.*[^AZ]{3}
says to match exactly one character in the set between A and Z, then match anything, then exactly 01
, then anything, then exactly 3 characters which are not in the set between A and Z. C01;;;
将它们放在一起:
[AZ].*01.*[^AZ]{3}
表示要匹配 A 和 Z 之间的集合中的一个字符,然后匹配任何内容,然后是01
,然后是任何内容,然后是 3 个字符不在A和Z之间的集合中C01;;;
and blah blah Z blah 01 blah blah abc
both match.和
blah blah Z blah 01 blah blah abc
都匹配。
Regex 101 is a valuable resource for understanding regexes. Regex 101是理解正则表达式的宝贵资源。 Regular-Expressions.info is a very good tutorial site.
Regular-Expressions.info是一个非常好的教程网站。
First question is: {3} means that there must be atleast 3 characters that belong to a "part" of the regular expression(lets say 3[A − Z]) or it is strictly refering to the last one ([ˆ[A − Z]])?
第一个问题是:{3} 意味着必须有至少 3 个字符属于正则表达式的“部分”(比如说 3[A - Z]),或者它严格指代最后一个字符([^[A − Z]])?
{3}
is a "quantifier" . {3}
是一个“量词” 。 So are +
(one or more), *
(zero or more), and ?
+
(一个或多个)、 *
(零个或多个)和?
(zero or one). (零或一)。 All quantifiers match the thing immediately preceding it.
所有量词都匹配紧接在它前面的事物。
A{3}
means "AAA". A{3}
表示“AAA”。 [AZ]{3}
means exactly three characters in the set of A through Z. [AZ]{3}
表示从 A 到 Z 的集合中的三个字符。
My second doubt is: if it is the last one, checking if there are atleast 3 occurrences might be easy(just 3 states that check if the char is a number, otherwise exit),right?
我的第二个疑问是:如果它是最后一个,检查是否有至少 3 次出现可能很容易(只有 3 个状态检查 char 是否为数字,否则退出),对吗? Otherwise, if it might be any of the possible part of the regular expression, how do i check without a counter(eventually confirm if i shouldnt be using a counter) how many occurrences repeat in any possible state?
否则,如果它可能是正则表达式的任何可能部分,我如何在没有计数器的情况下检查(最终确认我是否不应该使用计数器)在任何可能的 state 中重复出现多少次?
Regular expressions are insanely complicated.正则表达式非常复杂。 They are a language unto themselves.
它们本身就是一种语言。 Unless this is for a class, use a regular expression library such as PCRE .
除非这是针对 class 的,否则请使用诸如PCRE之类的正则表达式库。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.