简体   繁体   English

正则表达式内的条件

[英]Condition inside regex pattern

I would like to remove any extra whitespace from my code, I'm parsing a docblock. 我想从我的代码中删除任何额外的空格,我正在解析一个docblock。 The problem is that I do not want to remove whitespace within a <code>code goes here</code> . 问题是我不想删除<code>code goes here</code>中的空格。

Example, I use this to remove extra whitespace: 例如,我用它来删除额外的空格:

$string = preg_replace('/[ ]{2,}/', '', $string);

But I would like to keep whitespace within <code></code> 但我想在<code></code>保留空格

This code/string: 这段代码/字符串:

This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

Should be transformed into: 应该转化为:

This is some text
This is also some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

How can I do this? 我怎样才能做到这一点?

You aren't really looking for a condition - you need a way to skip parts of the string so they are not replaced. 你并不是真的在寻找条件 - 你需要一种方法来跳过部分字符串,这样它们就不会被替换。 This can be done rather easily using preg_replace , by inserting dummy groups and replacing each group with itself. 使用preg_replace ,通过插入虚拟组并用自身替换每个组,可以相当容易地完成此操作。 In your case you only need one: 在您的情况下,您只需要一个:

$str = preg_replace("~(<code>.*?</code>)|^ +| +$|( ) +~smi" , "$1$2", $str);

How does it work? 它是如何工作的?

  • (<code>.*?</code>) - Match a <code> block into the first group, $1 . (<code>.*?</code>) - 将<code>块与第一组匹配, $1 This assumes simple formatting and no nesting, but can be complicated if needed. 这假定简单格式化并且没有嵌套,但如果需要可能会很复杂。
  • ^ + - match and remove spaces on beginnings of lines. ^ + - 匹配并删除行首的空格。
  • [ ]+$ - match and remove spaces on ends of lines. [ ]+$ - 匹配并删除行尾的空格。
  • ( ) + match two or more spaces in the middle of lines, and capture the first one to the second group, $2 . ( ) +匹配行中间的两个或多个空格,并将第一个空格捕获到第二个组, $2

The replace string, $1$2 will keep <code> blocks and the first space if captured, and remove anything else it matches. 替换字符串$1$2将保留<code>块和第一个空格(如果已捕获),并删除其匹配的任何其他内容。

Things to remember: 要记住的事情:

  • If $1 or $2 didn't capture, it will be replaced with an empty string. 如果$1$2没有捕获,它将被替换为空字符串。
  • Alternations ( a|b|c ) work from left to right - when it makes a match it is satisfied, and doesn't try matching again. 交替( a|b|c )从左到右工作 - 当它匹配时,它就满足了,并且不再尝试匹配。 That is why ^ +| +$ 这就是为什么^ +| +$ ^ +| +$ must be before ( ) + . ^ +| +$必须在( ) +之前。

Working example: http://ideone.com/HxbaV 工作示例: http//ideone.com/HxbaV

When parsing markup with PHP and regex, the preg_replace_callback() function combined with the (?R), (?1), (?2)... recursive expressions, make for a very powerful tool indeed. 使用PHP和regex解析标记时, preg_replace_callback()函数与(?R), (?1), (?2)...递归表达式相结合,确实是一个非常强大的工具。 The following script handles your test data quite nicely: 以下脚本非常好地处理您的测试数据:

<?php // test.php 20110312_2200

function clean_non_code(&$text) {
    $re = '%
    # Match and capture either CODE into $1 or non-CODE into $2.
      (                      # $1: CODE section (never empty).
        <code[^>]*>          # CODE opening tag
        (?R)+                # CODE contents w/nested CODE tags.
        </code\s*>           # CODE closing tag
      )                      # End $1: CODE section.
    |                        # Or...
      (                      # $2: Non-CODE section (may be empty).
        [^<]*+               # Zero or more non-< {normal*}
        (?:                  # Begin {(special normal*)*}
          (?!</?code\b)      # If not a code open or close tag,
          <                  # match non-code < {special}
          [^<]*+             # More {normal*}
        )*+                  # End {(special normal*)*}
      )                      # End $2: Non-CODE section
    %ix';

    $text = preg_replace_callback($re, '_my_callback', $text);
    if ($text === null) exit('PREG Error!\nTarget string too big.');
    return $text;
}

// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
    if ($matches[1]) {
        return $matches[1];
    }
    // Collapse multiple tabs and spaces into a single space.
    $matches[2] = preg_replace('/[ \t][ \t]++/S', ' ', $matches[2]);
    // Trim each line
    $matches[2] = preg_replace('/^ /m', '', $matches[2]);
    $matches[2] = preg_replace('/ $/m', '', $matches[2]);
    return $matches[2];
}

// Create some test data.
$data = "This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1'      => 'value1',
    'key2'      => 'value1',
    'key42'     => '<code>
        Pay no attention to this. It has been proven over and
        over again that it is <code>   unpossible   </code>
        to parse nested stuff with regex!           </code>'
));
</code>";

// Demonstrate that it works on one small test string.
echo("BEFORE:\n". $data ."\n\n");
echo("AFTER:\n". clean_non_code($data) ."\n\nTesting...");

// Build a large test string.
$bigdata = '';
for ($i =   0; $i < 30000; ++$i) $bigdata .= $data;
$size = strlen($bigdata);

// Measure how long it takes to process it.
$time = microtime(true);
$bigdata = clean_non_code($bigdata);
$time = microtime(true) - $time;

// Print benchmark results
printf("Done.\nTest string size: %d bytes. Time: %.3f sec. Speed: %.0f KB/s.\n",
    $size, $time, ($size / $time)/1024.);
?>

Here are the script benchmark results when run on my test box: WinXP32 PHP 5.2.14 (cli) 以下是在我的测试框上运行时的脚本基准测试结果:WinXP32 PHP 5.2.14(cli)

'Test string size: 10410000 bytes. Time: 1.219 sec. Speed: 8337 KB/s.'

Note that this solution does not handle CODE tags having <> angle brackets in their attributes (probably a very rare edge case), but the regex could be easily modified to handle these as well. 请注意,此解决方案不处理在其属性中具有<>尖括号的CODE标记(可能是非常罕见的边缘情况),但也可以轻松修改正则表达式以处理这些标记。 Note also that the maximum string length will depend upon the composition of the string content (ie Big CODE blocks reduce the maximum input string length.) 另请注意,最大字符串长度将取决于字符串内容的组成(即Big CODE块会减少最大输入字符串长度。)

ps Note to SO staff. ps注意SO员工。 The <!-- language: lang-none --> doesn't work. <!-- language: lang-none -->不起作用。

What you will want is to parse it using some form of HTML parser. 你想要的是使用某种形式的HTML解析器来解析它。

For example, you could iterate through all elements ignoring code elements with DOMDocument and strip whitespace from their text nodes. 例如,您可以遍历所有元素,忽略带有DOMDocument的code元素,并从文本节点中删除空格。

Alternatively, open the file with fopen() so you have an array of lines, and step through each line stripping whitespace if outside of a code element. 或者,使用fopen()打开文件,这样你就有了一个行数组,如果在code元素之外,则逐行遍历每一行的空格。

To determine if you are in a code element, look for the starting tag <code> and set a flag which says in code element mode . 要确定您是否在code元素中,请查找起始标记<code>并设置一个code元素模式表示的标记。 You can then skip these lines. 然后,您可以跳过这些行。 Reset the flag when you encounter </code> . 遇到</code>时重置标志。 You could take into account nesting by having its state stored as an integer, even though nested code elements are not the smartest idea ( why would you nest them)? 您可以通过将其状态存储为整数来考虑嵌套,即使嵌套的code元素不是最明智的想法( 为什么要嵌套它们)?

Mario came up with this before me. 马里奥在我之前想出了这个。

Parsing HTML with regexes is a bad idea. 用正则表达式解析HTML是个坏主意。

RegEx match open tags except XHTML self-contained tags 除了XHTML自包含标记之外,RegEx匹配开放标记

Use something like Zend_DOM to parse HTML and extract parts of it you need to replace spaces in. 使用像Zend_DOM这样的东西来解析HTML并提取你需要替换空间的部分内容。

正则表达式:要检测的模式<embed>不在里面的标签<object>标签?<div id="text_translate"><p> 我正在尝试获取动态生成页面上的视频总数。 为此,我解析页面的 html 并搜索所有<object> 、 <iframe>和<embed>标签。 除了视频嵌入代码之外,该页面不会包含任何其他类型的 iframe 内容,因此我可以确定任何 iframe 标签都是视频。 问题是一些嵌入代码,例如 Hulu,在<object>标签内有<embed>标签。 所以用我目前的正则表达式:</p><pre> '/(<iframe|<object|<embed)/i'</pre><p> 此 Hulu 嵌入代码被视为 2 个视频而不是一个:</p><pre> <object id="videoplayer1" width="728" height="407"> <param name="movie" value='http://www.hulu.com/embed/7qXAa2z1zXKPMw4mBakrRw'></param> <param name="allowFullScreen" value="true"></param> <param name="allowScriptAccess" value="never"></param> <embed src='http://www.hulu.com/embed/7qXAa2z1zXKPMw4mBakrRw' type="application/x-shockwave-flash" allowfullscreen="true" width="728" height="407" allowscriptaccess='never'></embed> </object></pre><p> 我不想搜索所有嵌入标签,而是只想搜索未被<object>标签封装的标签。 所以上面的 hulu 一个将被避免,但像这样的一个将被计算在内:</p><pre> <embed src="http://www.ebaumsworld.com/player.swf" allowScriptAccess="always" flashvars="id1=81748652" wmode="opaque" width="567" height="345" allowfullscreen="true" /></pre><p> REGEX 模式看起来像什么,我使用的是 PHP。</p></div></object> - REGEX: Pattern to detect <embed> tags that are not inside <object> tags?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式模式中的OR条件 - OR condition in regex pattern PHP正则表达式模式排除在另一个模式内 - PHP regex pattern exclude if inside another pattern PHP Regex匹配Div中的选定模式 - PHP Regex Match Selected Pattern inside Div PHP用主题字符串内的模式替换正则表达式 - PHP replace regex with pattern inside subject string PHP的正则表达式提取模式,但里面只有数字 - php regex extract pattern but only the number inside it 正则表达式 - 仅当模式不在引号内时才替换 - Regex - Replace only if the pattern is not inside a quote 检查条件,并通过Zend中的Regex识别URL中的模式 - Check a condition and also identify the pattern in url through Regex in Zend 正则表达式:要检测的模式<embed>不在里面的标签<object>标签?<div id="text_translate"><p> 我正在尝试获取动态生成页面上的视频总数。 为此,我解析页面的 html 并搜索所有<object> 、 <iframe>和<embed>标签。 除了视频嵌入代码之外,该页面不会包含任何其他类型的 iframe 内容,因此我可以确定任何 iframe 标签都是视频。 问题是一些嵌入代码,例如 Hulu,在<object>标签内有<embed>标签。 所以用我目前的正则表达式:</p><pre> '/(<iframe|<object|<embed)/i'</pre><p> 此 Hulu 嵌入代码被视为 2 个视频而不是一个:</p><pre> <object id="videoplayer1" width="728" height="407"> <param name="movie" value='http://www.hulu.com/embed/7qXAa2z1zXKPMw4mBakrRw'></param> <param name="allowFullScreen" value="true"></param> <param name="allowScriptAccess" value="never"></param> <embed src='http://www.hulu.com/embed/7qXAa2z1zXKPMw4mBakrRw' type="application/x-shockwave-flash" allowfullscreen="true" width="728" height="407" allowscriptaccess='never'></embed> </object></pre><p> 我不想搜索所有嵌入标签,而是只想搜索未被<object>标签封装的标签。 所以上面的 hulu 一个将被避免,但像这样的一个将被计算在内:</p><pre> <embed src="http://www.ebaumsworld.com/player.swf" allowScriptAccess="always" flashvars="id1=81748652" wmode="opaque" width="567" height="345" allowfullscreen="true" /></pre><p> REGEX 模式看起来像什么,我使用的是 PHP。</p></div></object> - REGEX: Pattern to detect <embed> tags that are not inside <object> tags? 尝试在 PHP 中创建与模式内的模式匹配的正则表达式 - Trying to create a regex in PHP that matches patterns inside a pattern 正则表达式:只有在没有特殊字符/内部注释的情况下才匹配模式 - regex: match pattern only if not preceded by special character / inside comment
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM