正则表达式内的条件

Question

I would like to remove any extra whitespace from my code, I'm parsing a docblock. 我想从我的代码中删除任何额外的空格，我正在解析一个docblock。 The problem is that I do not want to remove whitespace within a <code>code goes here</code> . 问题是我不想删除<code>code goes here</code>中的空格。

Example, I use this to remove extra whitespace: 例如，我用它来删除额外的空格：

$string = preg_replace('/[ ]{2,}/', '', $string);

But I would like to keep whitespace within <code></code> 但我想在<code></code>保留空格

This code/string: 这段代码/字符串：

This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

Should be transformed into: 应该转化为：

This is some text
This is also some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

How can I do this? 我怎样才能做到这一点？

Answer 1

You aren't really looking for a condition - you need a way to skip parts of the string so they are not replaced. 你并不是真的在寻找条件 - 你需要一种方法来跳过部分字符串，这样它们就不会被替换。 This can be done rather easily using preg_replace , by inserting dummy groups and replacing each group with itself. 使用preg_replace ，通过插入虚拟组并用自身替换每个组，可以相当容易地完成此操作。 In your case you only need one: 在您的情况下，您只需要一个：

$str = preg_replace("~(<code>.*?</code>)|^ +| +$|( ) +~smi" , "$1$2", $str);

How does it work? 它是如何工作的？

(<code>.*?</code>) - Match a <code> block into the first group, $1 . (<code>.*?</code>) - 将<code>块与第一组匹配， $1 。 This assumes simple formatting and no nesting, but can be complicated if needed. 这假定简单格式化并且没有嵌套，但如果需要可能会很复杂。
^ + - match and remove spaces on beginnings of lines. ^ + - 匹配并删除行首的空格。
[ ]+$ - match and remove spaces on ends of lines. [ ]+$ - 匹配并删除行尾的空格。
( ) + match two or more spaces in the middle of lines, and capture the first one to the second group, $2 . ( ) +匹配行中间的两个或多个空格，并将第一个空格捕获到第二个组， $2 。

The replace string, $1$2 will keep <code> blocks and the first space if captured, and remove anything else it matches. 替换字符串$1$2将保留<code>块和第一个空格（如果已捕获），并删除其匹配的任何其他内容。

Things to remember: 要记住的事情：

If $1 or $2 didn't capture, it will be replaced with an empty string. 如果$1或$2没有捕获，它将被替换为空字符串。
Alternations ( a|b|c ) work from left to right - when it makes a match it is satisfied, and doesn't try matching again. 交替（ a|b|c ）从左到右工作 - 当它匹配时，它就满足了，并且不再尝试匹配。 That is why ^ +| +$ 这就是为什么^ +| +$ ^ +| +$ must be before ( ) + . ^ +| +$必须在( ) +之前。

Working example: http://ideone.com/HxbaV 工作示例： http ： //ideone.com/HxbaV

Answer 2

When parsing markup with PHP and regex, the preg_replace_callback() function combined with the (?R), (?1), (?2)... recursive expressions, make for a very powerful tool indeed. 使用PHP和regex解析标记时， preg_replace_callback()函数与(?R), (?1), (?2)...递归表达式相结合，确实是一个非常强大的工具。 The following script handles your test data quite nicely: 以下脚本非常好地处理您的测试数据：

<?php // test.php 20110312_2200

function clean_non_code(&$text) {
    $re = '%
    # Match and capture either CODE into $1 or non-CODE into $2.
      (                      # $1: CODE section (never empty).
        <code[^>]*>          # CODE opening tag
        (?R)+                # CODE contents w/nested CODE tags.
        </code\s*>           # CODE closing tag
      )                      # End $1: CODE section.
    |                        # Or...
      (                      # $2: Non-CODE section (may be empty).
        [^<]*+               # Zero or more non-< {normal*}
        (?:                  # Begin {(special normal*)*}
          (?!</?code\b)      # If not a code open or close tag,
          <                  # match non-code < {special}
          [^<]*+             # More {normal*}
        )*+                  # End {(special normal*)*}
      )                      # End $2: Non-CODE section
    %ix';

    $text = preg_replace_callback($re, '_my_callback', $text);
    if ($text === null) exit('PREG Error!\nTarget string too big.');
    return $text;
}

// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
    if ($matches[1]) {
        return $matches[1];
    }
    // Collapse multiple tabs and spaces into a single space.
    $matches[2] = preg_replace('/[ \t][ \t]++/S', ' ', $matches[2]);
    // Trim each line
    $matches[2] = preg_replace('/^ /m', '', $matches[2]);
    $matches[2] = preg_replace('/ $/m', '', $matches[2]);
    return $matches[2];
}

// Create some test data.
$data = "This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1'      => 'value1',
    'key2'      => 'value1',
    'key42'     => '<code>
        Pay no attention to this. It has been proven over and
        over again that it is <code>   unpossible   </code>
        to parse nested stuff with regex!           </code>'
));
</code>";

// Demonstrate that it works on one small test string.
echo("BEFORE:\n". $data ."\n\n");
echo("AFTER:\n". clean_non_code($data) ."\n\nTesting...");

// Build a large test string.
$bigdata = '';
for ($i =   0; $i < 30000; ++$i) $bigdata .= $data;
$size = strlen($bigdata);

// Measure how long it takes to process it.
$time = microtime(true);
$bigdata = clean_non_code($bigdata);
$time = microtime(true) - $time;

// Print benchmark results
printf("Done.\nTest string size: %d bytes. Time: %.3f sec. Speed: %.0f KB/s.\n",
    $size, $time, ($size / $time)/1024.);
?>

Here are the script benchmark results when run on my test box: WinXP32 PHP 5.2.14 (cli) 以下是在我的测试框上运行时的脚本基准测试结果：WinXP32 PHP 5.2.14（cli）

'Test string size: 10410000 bytes. Time: 1.219 sec. Speed: 8337 KB/s.'

Note that this solution does not handle CODE tags having <> angle brackets in their attributes (probably a very rare edge case), but the regex could be easily modified to handle these as well. 请注意，此解决方案不处理在其属性中具有<>尖括号的CODE标记（可能是非常罕见的边缘情况），但也可以轻松修改正则表达式以处理这些标记。 Note also that the maximum string length will depend upon the composition of the string content (ie Big CODE blocks reduce the maximum input string length.) 另请注意，最大字符串长度将取决于字符串内容的组成（即Big CODE块会减少最大输入字符串长度。）

ps Note to SO staff. ps注意SO员工。 The  doesn't work. 不起作用。

Answer 3

What you will want is to parse it using some form of HTML parser. 你想要的是使用某种形式的HTML解析器来解析它。

For example, you could iterate through all elements ignoring code elements with DOMDocument and strip whitespace from their text nodes. 例如，您可以遍历所有元素，忽略带有DOMDocument的code元素，并从文本节点中删除空格。

Alternatively, open the file with fopen() so you have an array of lines, and step through each line stripping whitespace if outside of a code element. 或者，使用fopen()打开文件，这样你就有了一个行数组，如果在code元素之外，则逐行遍历每一行的空格。

To determine if you are in a code element, look for the starting tag <code> and set a flag which says in code element mode . 要确定您是否在code元素中，请查找起始标记<code>并设置一个以code元素模式表示的标记。 You can then skip these lines. 然后，您可以跳过这些行。 Reset the flag when you encounter </code> . 遇到</code>时重置标志。 You could take into account nesting by having its state stored as an integer, even though nested code elements are not the smartest idea ( why would you nest them)? 您可以通过将其状态存储为整数来考虑嵌套，即使嵌套的code元素不是最明智的想法（ 为什么要嵌套它们）？

Mario came up with this before me. 马里奥在我之前想出了这个。

Answer 4

Parsing HTML with regexes is a bad idea. 用正则表达式解析HTML是个坏主意。

RegEx match open tags except XHTML self-contained tags 除了XHTML自包含标记之外，RegEx匹配开放标记

Use something like Zend_DOM to parse HTML and extract parts of it you need to replace spaces in. 使用像Zend_DOM这样的东西来解析HTML并提取你需要替换空间的部分内容。

正则表达式内的条件

问题描述

4 个解决方案

解决方案1
4 已采纳 2011-03-13 07:55:15

解决方案2
2 2011-03-13 06:50:33

解决方案3
1 2011-03-12 16:35:10

解决方案4
0 2011-03-12 15:18:31

正则表达式内的条件

问题描述

4 个解决方案

解决方案1 4 已采纳 2011-03-13 07:55:15

解决方案2 2 2011-03-13 06:50:33

解决方案3 1 2011-03-12 16:35:10

解决方案4 0 2011-03-12 15:18:31

解决方案1
4 已采纳 2011-03-13 07:55:15

解决方案2
2 2011-03-13 06:50:33

解决方案3
1 2011-03-12 16:35:10

解决方案4
0 2011-03-12 15:18:31