正则表达式：匹配所有行中的子字符串，除非子字符串位于注释部分内

Question

Here I go: 我来啦：

I'm coding a PHP application, and I've got a new official domain for it, where all the FAQ are now located. 我正在编写一个PHP应用程序，我有一个新的官方域名，现在所有的FAQ都在这里。 Some of the files in my script include help links to the old FAQ domain, so I want to replace them using the new domain. 我脚本中的一些文件包含旧FAQ域的帮助链接，因此我想使用新域替换它们。 However, I want to keep the URLs linking to the old domain only if they are located under a comment or comment block (I still use the old domain for self-reference and other documentation). 但是，我想保留链接到旧域的URL，只要它们位于注释或注释块下面（我仍然使用旧域进行自我引用和其他文档）。

So, basically, what I want to achieve is a regular expression that works given the following: 所以，基本上，我想要实现的是一个正则表达式，它具有以下功能：

Match all the occurrences of example.com in all lines*. 匹配所有行中example.com的所有匹配项*。
Don't match the entire line, only the example.com string. 不匹配整行，只匹配example.com字符串。
- If the line starts with // , /* , or " *" don't match any example.com instance in that single line (although, this might be a problem if a comment block is closed in the same line where it was opened). 如果行以// ， /*或“*”开头，则不匹配该单行中的任何example.com实例（但是，如果注释块在打开它的同一行中关闭，则可能会出现问题）。

I usually write my block comments like this: 我经常写这样的块评论：

/* text
 * blah 
 * blah
*/

That's why I don't want to match "example.com" if it's situated after // , /* , or " *". 这就是为什么我不想匹配“example.com”，如果它位于// ， /*或“*”之后。

I figured it would be something like this: 我想它会是这样的：

^(?:(?!//|/\*|\s\*).?).*example\.com

But this has one issue: it matches the whole line, instead of "example.com" only (this causes problems mainly when two or more "example.com" strings are matched in a single line). 但是这有一个问题：它匹配整行，而不仅仅是“example.com”（这会导致问题，主要是当两个或多个“example.com”字符串在一行中匹配时）。

Can someone please help me fix my regex? 有人可以帮我修复我的正则表达式吗？ Please note: It doesn't have to be a PHP regex, since I could always use a tool like grepWin to locally edit all the files at once. 请注意：它不一定是PHP正则表达式，因为我总是可以使用像grepWin这样的工具来一次本地编辑所有文件。

Oh, and please let me know if there's a way to generalize block comments in some way, like this: once /* is found, do not match example.com until */ is found. 哦， 请告诉我是否有办法以某种方式概括块注释，例如：找到/*在找到*/之前不匹配example.com 。 That would be extremely useful. 那将非常有用。 Is it possible to achieve it in general (non language-dependent) regular expressions? 是否有可能实现一般（非语言相关）正则表达式？

Answer 1

A regex that only matches example.com if it's not inside a comment section (but that does not care about line comments, so you'd have to do this separately): 一个正则表达式只匹配example.com如果它不在评论部分内（但不关心行注释，所以你必须单独执行此操作）：

$result = preg_replace(
    '%example\.com # Match example.com
    (?!            # only if it\'s not possible to match
     (?:           # the following:
      (?!/\*)      #  (unless an opening comment starts first)
      .            #  any character
     )*            # any number of times
     \*/           # followed by a closing comment.
    )              # End of lookahead
    %sx', 
    'newdomain.com', $subject);

Answer 2

I would use some kind of tokenizer to tell comments and other language tokens apart. 我会使用某种标记器来区分评论和其他语言标记。

As you're processing PHP files, you should use PHP's own tokenizer function token_get_all : 在处理PHP文件时，您应该使用PHP自己的tokenizer函数token_get_all ：

$tokens = token_get_all($source);

Then you can enumerate the tokens and separate the tokens by their type : 然后，您可以枚举标记并按类型分隔标记：

foreach ($tokens as &$token) {
    if (in_array($token[0], array(T_COMMENT, T_DOC_COMMENT, T_ML_COMMENT))) {
        // comment
    } else {
        // not a comment
        $token[1] = str_replace('example.com', 'example.net', $token[1]);
    }
}

At the end, put everything back together with implode . 最后，把一切都回到一起implode 。

For other languages that you don't have a proper tokenizer at the hand, you can write your own little tokenizer: 对于您手边没有合适的标记器的其他语言，您可以编写自己的小标记器：

preg_match_all('~/\*.*?\*/|//(?s).*|(example\.com)|.~', $code, $tokens, PREG_SET_ORDER);
foreach ($tokens as &$token) {
    if (strlen($token[1])) {
        $token = str_replace('example.com', 'example.net', $token[1]);
    } else {
        $token = $token[0];
    }
}
$code = implode('', $tokens);

Note that this does not take any other token like strings into account. 请注意，这不会考虑任何其他令牌，如字符串。 So this won't match example.com if it appears in a string but also in a 'comment' like: 因此，如果它出现在字符串中，而且还在“评论”中，则不会匹配example.com ：

'foo /* not a comment example.com */ bar'

正则表达式：匹配所有行中的子字符串，除非子字符串位于注释部分内

问题描述

2 个解决方案

解决方案1
2 2012-07-29 08:08:20

解决方案2
2 2012-07-29 08:29:44

正则表达式：匹配所有行中的子字符串，除非子字符串位于注释部分内

问题描述

2 个解决方案

解决方案1 2 2012-07-29 08:08:20

解决方案2 2 2012-07-29 08:29:44

解决方案1
2 2012-07-29 08:08:20

解决方案2
2 2012-07-29 08:29:44