简体   繁体   English

Java正则表达式中的所有格限定词有什么用?

[英]what are possessive quantifiers in Java regular expression used for?

I'm reading about regular expression in Java. 我正在阅读有关Java中的正则表达式的信息。 And I understand that possessive quantifiers do not backtrack and release characters to give a chance for other group to achieve a match. 而且我知道所有格量词不会回退和释放角色,不会给其他小组提供匹配的机会。 But I couldn't figure any situations where possessive quantifiers are used in reality. 但是我无法弄清楚现实中使用所有格量词的任何情况。 I have read some resources saying that since possessive quantifiers don't backtrack, they don't need to remember the position of each character in the input string, which helps to significantly improve performance of the regular expression engine. 我已经阅读了一些资源,他们说,由于所有格量词不会回溯,因此它们不需要记住输入字符串中每个字符的位置,这有助于显着提高正则表达式引擎的性能。 I have tested this by writing an example: 我已经通过编写示例进行了测试:

I have a string containing about thousands of digits. 我有一个包含约数千个数字的字符串。

First I defined a greedy: String regex = "(\\d+)"; 首先,我定义一个贪婪: String regex = "(\\d+)";

Then I counted the time it took. 然后我数了一下时间。

Second: I change to possessive: String regex = "(\\d++)"; 第二:我更改为所有格: String regex = "(\\d++)";

Also I counted the time it took but I don't see any difference in time 我也数了一下时间,但是我看不出时间有什么不同

Am I misunderstanding something? 我误会了吗?

And besides, can anyone give me some specific cases where possessive quantifiers are in use? 此外,有人可以给我一些使用所有格修饰词的特定情况吗?

And about the term: In the book " Java Regular Expressions Taming the Java.Util.Regex Engine by Mehran Habibi " he used the term " possessive qualifiers ", while I read in the Internet, people used " Possessive quantifier ". 关于该术语:在Java Regular Expressions Taming the Java.Util.Regex Engine by Mehran Habibi撰写的“ Java Regular Expressions Taming the Java.Util.Regex Engine by Mehran Habibi ”一书中,他使用了“ possessive qualifiers ”一词,而我在互联网上阅读时,人们使用了“ Possessive quantifier ”。 Which one is correct or both? 哪一个是正确的还是两者都正确?

Possessive quantifiers are quantifiers that are greedy (they try to match as many characters as possible) and don't backtrack (it is possible matching fails if the possessive quantifiers go to far). 拥有量词是贪婪的量词(它们尝试匹配尽可能多的字符)并且不回溯(如果所有格量词走得太远,匹配可能会失败)。

Example

Normal (greedy) quantifiers 普通(贪婪)量词

Say you have the following regex: 假设您有以下正则表达式:

^([A-Za-z0-9]+)([A-Z0-9][A-Z0-9])(.*)

The regex aims to match "one or more alphanumerical-characters (case independent) [A-Za-z0-9] and should end with two alphanumerical characters and then any character can occur. 正则表达式旨在匹配“一个或多个字母数字字符(不区分大小写) [A-Za-z0-9]并且应以两个字母数字字符结尾,然后可以出现任何字符。

Any string that obeys this constraint will match. 符合此约束的任何字符串都将匹配。 AAA as well. AAA也是如此。 One can claim that the second and the third A should belong to the second group, but that would result in the fact that the string will not match. 可以声称第二个和第三个A应该属于第二个组,但这将导致字符串不匹配。 The regex has thus the intelligence (using dynamic programming), to know when to leave the (first) ship. 正则表达式因此具有智能(使用动态编程),可以知道何时离开(第一艘)战舰。

Non-greedy quantifiers 非贪婪量词

Now a problem that can occur is that the first group is "too greedy" for data extraction purposes. 现在可能出现的问题是,出于数据提取的目的,第一组“过于贪婪”。 Say you have the following string AAAAAAA . 假设您具有以下字符串AAAAAAA Several subdivisions are possible: (A)(AA)(AAAA) , (AA)(AA)(AAA) , etc. By default, each group in a regex is as greedy as possible (as long as this has no effect on the fact whether the string will still be matched). 可能有几个细分: (A)(AA)(AAAA)(AA)(AA)(AAA)等。默认情况下,正则表达式中的每个组都尽可能贪婪(只要这对字符串是否仍然匹配)。 The regex will thus subdivide the string in (AAAAA)(AA)() . 正则表达式将因此将字符串细分为(AAAAA)(AA)() If you want to extract data in such a way, that from the moment one character has been passed, from the moment two characters in the [A-Z0-9] range occur, the regex should move to the next group. 如果要以这样一种方式提取数据,即从经过一个字符的那一刻起,从出现[A-Z0-9]范围中的两个字符的那一刻起,则正则表达式应移至下一组。

In order to achieve this, you can write: 为了实现这一点,您可以编写:

^([A-Za-z0-9]+?)([A-Z0-9][A-Z0-9])(.*)

The string AAAAAAA will match with (A)(AA)(AAAA) . 字符串AAAAAAA将与(A)(AA)(AAAA)匹配。

Possessive quantifiers 所有格量词

Possessive quantifiers are greedy quantifiers, but once it is possible, they will never give a character back to another group. 拥有量词是贪婪的量词,但是一旦有可能,它们就永远不会将角色还给另一个组。 For instance: 例如:

^([A-Z]++)([H-Zw])(.*)

If you would write ^([AZ]+)([HZ])(.*) a string AH0 would be matched. 如果您要写^([AZ]+)([HZ])(.*)则将匹配字符串AH0 The first group is greedy (taking A ), but since eating (that's the word sometimes used) H would result in the string not being matched, it is willingly to give up H . 第一组是贪婪的(取A ),但是由于吃(有时会用到这个词) H会导致字符串不匹配,因此愿意放弃H Using the possessive quantifiers. 使用所有格量词。 The group is not willing to give up H as well. 小组也不愿意放弃H As a result it eats both A and H . 结果,它同时吃了AH Only 0 is left for the second group, but the second group cannot eat that character. 第二组只剩下0 ,但是第二组不能吃掉那个角色。 As a result the regex fails where using the non possessive quantifiers would result in a successful match. 结果,正则表达式在使用非所有格量词会导致成功匹配的地方失败。 The string Aw will however successfully match, since the first group is not interested in w ... 但是,由于第一组对w不感兴趣,因此字符串Aw将成功匹配。

By default, quantifers are greedy. 默认情况下,量词是贪婪的。 They will try to match as much as possible. 他们将尝试尽可能地匹配。 The possessive quantifier prevents backtracking, meaning what gets matched by the regular expression will not be backtracked into, even if that causes the whole match to fail. 所有格量词可防止回溯,这意味着即使正则表达式匹配的内容导致整个匹配失败,也不会回溯到正则表达式匹配的内容。 As stated in Regex Tutorial ( Possessive Quantifiers ) ... 如正则表达式教程(所有格量词)中所述 ...

Possessive quantifiers are a way to prevent the regex engine from trying all permutations. 拥有量词是防止正则表达式引擎尝试所有排列的一种方法。 This is primarily useful for performance reasons. 这主要是出于性能方面的考虑。 You can also use possessive quantifiers to eliminate certain matches. 您还可以使用所有格量词来消除某些匹配项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM