简体   繁体   English

Java Regex问题

[英]Java Regex problem

I'm making an XMLParser for a Java program (I know there are good XMLParsers out there but I just want to do it). 我正在为Java程序制作XMLParser(我知道那里有不错的XMLParsers,但我只想这样做)。

I have a method called getAttributeValue(String xmlElement, String attribute) and am using regex to find a sequence of characters that have the attribute name plus 我有一个名为getAttributeValue(String xmlElement,String attribute)的方法,并且正在使用正则表达式查找具有属性名称加号的字符序列。

="any characters that aren't a double quote"

I can then parse the contents of the quotes. 然后,我可以解析引号的内容。 Unfortunately, I'm having trouble with the regex pattern. 不幸的是,我在使用正则表达式模式时遇到了麻烦。 If I use: 如果我使用:

Pattern p = Pattern.compile(attribute + "=\"(.)+\"");

Then I get a string starting with my attribute name, but because there are loads of attributes and values and the last one's value has the double quotes, I get the string I want plus all the other attribute names and values like so: 然后,我得到了一个以属性名称开头的字符串,但由于存在大量的属性和值,并且最后一个值具有双引号,因此我得到了想要的字符串以及所有其他属性名称和值,例如:

attributeOne="contents" attributeTwo="contents2" attributeThree="contents3"

So I thought that I could have a regex pattern that, instead of the "." 所以我认为我可以有一个正则表达式模式,而不是“。”。 any characters symbol, would have "any characters but not a double quote". 任何字符符号,将具有“任何字符,但不能带双引号”。 I have tried: 我努力了:

Pattern p = Pattern.compile(attribute + "=\"(.&&[^\"])+\"");
Pattern p = Pattern.compile(attribute + "=\"(.&&(^\"))+\"");
Pattern p = Pattern.compile(attribute + "=\"([.&&[^\"]]+)\"");

but none of them work. 但它们都不起作用。 I'd be grateful for any suggestions and comments. 如有任何建议和意见,我将不胜感激。

Thanks. 谢谢。

The regular expression pattern for: 正则表达式模式用于:

="any characters that aren't a double quote"

Is ="[^"]*" , which as a Java string literal is "=\\"[^\\"]*\\"" . ="[^"]*" ,作为Java字符串文字是"=\\"[^\\"]*\\""

The [...] construct is called a character class ; [...]构造称为字符类 eg [aeiou] matches one of any of the lowercase vowels. 例如[aeiou]匹配任何一个小写元音。 The [^...] construct is a negated character class ; [^...]构造是一个否定的字符类 eg [^aeiou] matches one of anything but the lowercase vowels (which includes consonants, symbols, digits, etc). 例如[^aeiou]匹配除小写元音(包括辅音,符号,数字等)之外的任何一种。

Note that this pattern does not allow escaped " in the String (see link below for patterns that account for this possibility). 请注意,此模式不允许在String进行转义" (请参见下面的链接以获取解决此问题的模式)。

References 参考文献

Related questions 相关问题


On greedy, reluctant, and negated character class matching 贪婪,勉强和否定的字符类匹配

To understand why ".+" doesn't "work" as expected, and why sometimes you see ".+?" 要了解为什么".+"不能按预期“工作”,为什么有时会看到".+?" reluctant version to try to "fix" this problem, consider the following example: 勉强版本尝试“修复”此问题,请考虑以下示例:

Example 1: From A to Z 示例1:从A到Z

Let's compare these two patterns: A.*Z and A.*?Z . 让我们比较一下这两种模式: A.*ZA.*?Z

Given the following input: 给出以下输入:

eeeAiiZuuuuAoooZeeee

The patterns yield the following matches: 模式产生以下匹配:

Let's first focus on what A.*Z does. 首先让我们关注A.*Z作用。 When it matched the first A , the .* , being greedy, first tries to match as many . 当它与第一个A匹配时,贪婪的.*首先尝试匹配多个. as possible. 尽可能。

eeeAiiZuuuuAoooZeeee
   \_______________/
    A.* matched, Z can't match

Since the Z doesn't match, the engine backtracks, and .* must then match one fewer . 由于Z不匹配,因此引擎回溯,并且.*必须少匹配一个. :

eeeAiiZuuuuAoooZeeee
   \______________/
    A.* matched, Z still can't match

This happens a few more times, until finally we come to this: 这又发生了几次,直到最后我们来到这里:

eeeAiiZuuuuAoooZeeee
   \__________/
    A.* matched, Z can now match

Now Z can match, so the overall pattern matches: 现在Z可以匹配,因此总体模式匹配:

eeeAiiZuuuuAoooZeeee
   \___________/
    A.*Z matched

By contrast, the reluctant repetition in A.*?Z first matches as few . 相比之下, A.*?Z的勉强重复首先匹配的次数很少. as possible, and then taking more . 尽可能多,然后再服用. as necessary. 有必要的。 This explains why it finds two matches in the input. 这解释了为什么它在输入中找到两个匹配项。

Here's a visual representation of what the two patterns matched: 这是两种模式匹配的直观表示:

eeeAiiZuuuuAoooZeeee
   \__/r   \___/r      r = reluctant
    \____g____/        g = greedy

Example: An alternative 示例:替代

In many applications, the two matches in the above input is what is desired, thus a reluctant .*? 在许多应用中,上述输入中的两个匹配是所希望的,因此是不愿意的.*? is used instead of the greedy .* to prevent overmatching. 用于代替贪婪.*以防止过度匹配。 For this particular pattern, however, there is a better alternative, using negated character class. 但是,对于这种特殊模式,使用否定字符类是更好的选择。

The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input ( as seen on ideone.com ). 模式A[^Z]*Z也找到与上述输入的A.*?Z模式相同的两个匹配项( 如ideone.com所示 )。 [^Z] is what is called a negated character class : it matches anything but Z . [^Z]是所谓的否定字符类 :它匹配Z任何字符

The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. 两种模式之间的主要区别在于性能:更严格地说,否定的字符类只能为给定输入匹配一种方式。 It doesn't matter if you use greedy or reluctant modifier for this pattern. 对这个模式使用贪婪或勉强的修饰符都没关系。 In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all. 实际上,在某些情况下,您甚至可以做得更好,并使用所谓的所有格量词,它根本不会回溯。

References 参考文献


Example 2: From A to ZZ 示例2:从A到ZZ

This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input. 这个例子应该是说明性的:它显示了在相同的输入下,贪婪的,不情愿的和否定的字符类模式如何不同地匹配。

eeAiiZooAuuZZeeeZZfff

These are the matches for the above input: 这些是上述输入的匹配项:

Here's a visual representation of what they matched: 这是它们匹配内容的直观表示:

         ___n
        /   \              n = negated character class
eeAiiZooAuuZZeeeZZfff      r = reluctant
  \_________/r   /         g = greedy
   \____________/g

Related questions 相关问题

try this: 尝试这个:

attribute + "=\".*?\""

The reason for this is: * instead of + because you can have an empty atribute: something="" 这样做的原因是: *而不是+因为您可以有一个空的属性: something=""
*? instead of * to make it reluctant instead of greedy. 而不是* ,而不是贪婪。
regular expressions tutorial on repetition 关于重复的正则表达式教程

attribute + "=\"[^\"]*\""

should work. 应该管用。 But what do you do if the string you're matching against might contain escaped quotes itself? 但是,如果要匹配的字符串本身可能包含转义的引号,该怎么办? Do you anticipate a need to handle this? 您预计有需要处理吗?

In that case, you could use 在这种情况下,您可以使用

attribute + "=\"(?:\\\\.|[^\"])*\""

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM