简体   繁体   English

正则表达式的协助

[英]Assitance with regexp

I feel shamed but I'm still not clear with some regexp aspects. 我感到羞耻,但对于某些正则表达式方面仍然不清楚。 I need to parse text file which contains a number of string literals of @"I'm a string" format. 我需要解析文本文件,其中包含许多@"I'm a string"格式的字符串文字。 I've composed simple pattern /@"([^"]*)"/si . It works perfect, preg_match_all returns a collection. But obviously it doesn't work properly if string literal contains escaped quotes like @"I'm plain string. I'm \\"qouted\\" string " 我已经编写了简单的模式/@"([^"]*)"/si 。它可以完美地运行,preg_match_all返回一个集合。但是,如果字符串文字包含转义的引号,例如@"I'm plain string. I'm \\"qouted\\" string "显然是不正确的@"I'm plain string. I'm \\"qouted\\" string " @"I'm plain string. I'm \\"qouted\\" string " . Would appreciate for any clue. @"I'm plain string. I'm \\"qouted\\" string " 。希望提供任何线索。

This is a use case for Freidl's classic "unrolled loop": ( EDIT fixed grouping for capture) 这是Freidl的经典“展开循环”的用例:(为捕获而编辑固定分组)

/"((?:[^"\\]|\\.)*)"/

This will match the quoted string, taking backslash-escaped quotes into account. 这将匹配带引号的字符串,并考虑反斜杠转义的引号。

The full regex you would use to match a field (including the @ ) would be: 用于匹配字段(包括@ )的完整正则表达式为:

/@"((?:[^"\\]|\\.)*)"/

But be careful! 不过要小心! I often see people complaining that this pattern doesn't work in PHP, and this is because of the slightly mind-melting nature of using a backslash in string. 我经常看到有人抱怨这种模式在PHP中不起作用,这是因为在字符串中使用反斜杠具有令人不寒而栗的特性。

The backslashes in the above pattern represent a literal backslash that needs to be passed into PCRE. 上述模式中的反斜杠表示需要传递给PCRE的文字反斜杠。 This means that they need to be double-escaped when using them in a PHP string: 这意味着在PHP字符串中使用它们时,需要对它们进行两次转义:

$expr = '/@"((?:[^"\\\\]|\\\\.)*)"/';

preg_match_all($expr, $subject, $matches);

print_r($matches[1]); // this will show the content of all the matched fields

See it working 看到它正常工作

How does it work? 它是如何工作的?

...I hear you ask. ...我听到你问。 Well, lets see if I can explain this in a way that actually makes sense. 好吧,让我们看看我是否可以用一种切实可行的方式来解释这一点。 Let's enable x mode so we can space it out a bit: 让我们启用x模式,以便我们将其间隔一些:

/
  @             # literal @
  "             # literal "
    (           # start capture group, we want everything between the quotes
      (?:       # start non-capturing group (a group we can safely repeat)
        [^"\\]  # match any character that's not a " or a \
        |       # ...or...
        \\.     # a literal \ followed by any character
      )*        # close non-capturing group and allow zero or more occurrences
    )           # close the capture group
  "             # literal "
/x

This really important points are these: 这些真正重要的一点是:

  • [^"\\\\]|\\\\. - means that every backslash is "balanced" - every backslash must escape a character, and no character will be considered more than once. [^"\\\\]|\\\\.表示每个反斜杠都是“平衡的”-每个反斜杠必须转义一个字符,并且不会将一个字符视为一次以上。
  • Wrapping the above in a * repeated group means that the above pattern can occur an unlimited number of times, and that empty strings are allowed (if you don't want to allow empty strings, change the * to a + ). 将上面的内容包装为*重复的组意味着上面的模式可以无限次发生,并且允许使用空字符串(如果您不想允许使用空字符串,请将*更改为+ )。 This is the "loop" part of the "unrolled loop". 这是“展开循环”的“循环”部分。

But the output string still contains the backslashes that escape the quotes? 但是输出字符串中仍然包含转义符的反斜杠吗?

Indeed it does, this is just a matching procedure, it doesn't modify the match. 确实确实如此,这只是一个匹配过程,它不会修改匹配项。 But because the result is the contents of the string, a simple str_replace('\\\\"', '"', $result) will be safe and produce the correct result. 但是因为结果是字符串的内容 ,所以简单的str_replace('\\\\"', '"', $result)将是安全的,并且可以产生正确的结果。

However, when doing this sort of thing, I often find I want to handle other escape sequences as well - in which case I usually do something like this to the result: 但是,在执行此类操作时,我经常发现我也想处理其他转义序列-在这种情况下,我通常会对结果执行以下操作:

 preg_replace_callback('/\\./', function($match) {
     switch ($match[0][1]) { // inspect the escaped character
         case 'r':
             return "\r";

         case 'n':
             return "\n";

         case 't':
             return "\t";

         case '\\':
             return '\\';

         case '"':
             return '"';

         default: // if it's not a valid escape sequence, treat the \ as literal
             return $match[0];
     }
 }, $result);

This gives similar behaviour to a double-quoted string in PHP, where \\t is replaced with a tab, \\n is replaced with a newline and so on. 这与PHP中的双引号字符串具有相似的行为,其中\\t被制表符替换, \\n被换行符替换,依此类推。

What if I want to allow single-quoted strings as well? 如果我也想允许单引号字符串怎么办?

This has bugged me for a very long time. 这困扰了我很长时间。 I have always had a niggling feeling that this could be more efficiently handled with backreferences but numerous attempts have failed to yield any viable results. 我一直有一种feeling不安的感觉,即使用反向引用可以更有效地处理此问题,但是无数次尝试均未产生任何可行的结果。

I do this: 我这样做:

/(?:"((?:[^"\\]|\\.)*)")|(?:'((?:[^'\\]|\\.)*)')/

As you can see, this is basically just applying basically the same pattern twice, with an OR relationship. 如您所见,这基本上只是两次应用具有OR关系的基本相同的模式。 This complicates the string extraction very slightly on the PHP side as well: 这也使PHP方面的字符串提取非常复杂:

$expr = '/(?:"((?:[^"\\\\]|\\\\.)*)")|(?:\'((?:[^\'\\\\]|\\\\.)*)\')/';

preg_match_all($expr, $subject, $matches);

$result = array();
for ($i = 0; isset($matches[0][$i]); $i++) {
    if ($matches[1][$i] !== '') {
        $result[] = $matches[1][$i];
    } else {
        $result[] = $matches[2][$i];
    }
}

print_r($result);

You need to use a negative lookbehind - match everything until you find a quote not preceded by a backslash. 您需要在后面使用负数-匹配所有内容,直到找到不带反斜杠的引号为止。 This is in java: 这是在Java中:

public static void main(String[] args) {
    final String[] strings = new String[]{"@\"I'm a string\"", "@\"I'm plain string. I'm \\\"qouted\\\" \""};

    final Pattern p = Pattern.compile("@\"(.*)\"(?<!\\\\)");
    System.out.println(p.pattern());

    for (final String string : strings) {
        final Matcher matcher = p.matcher(string);
        while (matcher.find()) {
            System.out.println(matcher.group(1));
        }
    }
}

Output: 输出:

I'm a string
I'm plain string. I'm \"qouted\" 

The pattern (without all the Java escapes) is : @"(.*)"(?<!\\\\) 模式(没有所有Java转义)是: @"(.*)"(?<!\\\\)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM