简体   繁体   English

我需要使用什么reg表达式来匹配{{和}}之间的所有内容

[英]What reg expression patten to I need to match everything between {{ and }}

What reg expression patten to I need to match everything between {{ and }} 我需要使用什么reg表达式来匹配{{和}}之间的所有内容

I'm trying to parse wikipedia, but im ending up with orphan }} after running the rexex code. 我正在尝试解析Wikipedia,但是在运行rexex代码后,即时通讯最终以orphan}}结尾。 Here's my PHP script. 这是我的PHP脚本。

<?php

$articleName='england';

$url = "http://en.wikipedia.org/wiki/Special:Export/" . $articleName;
ini_set('user_agent','custom agent'); //required so that Wikipedia allows our request.

$feed = file_get_contents($url);
$xml = new SimpleXmlElement($feed);

$wikicode = $xml->page->revision->text;



$wikicode=str_replace("[[", "", $wikicode);
$wikicode=str_replace("]]", "", $wikicode);
$wikicode=preg_replace('/\{\{([^}]*(?:\}[^}]+)*)\}\}/','',$wikicode);

print($wikicode);

?>

I think the problem is I have nested {{ and }} eg 我认为问题是我嵌套了{{和}},例如

{{ something {{ something else {{ something new }}{{ something old }} something blue }} something green }} {{一些东西{{其他东西{{新的东西}} {{旧的东西}}蓝色的东西}}绿色的东西}}

You can use: 您可以使用:

\{\{(.*?)\}\}

Most regex flavors treat the brace { as a literal character, unless it is part of a repetition operator like {x,y} which is not the case here. 大部分正则表达式都将花括号{视为文字字符,除非它是{x,y}类的重复运算符的一部分,在此情况下并非如此。 So you do not need to escape it with a backslash, though doing it will give the same result. 因此,尽管这样做会得到相同的结果,但是您无需使用反斜杠来对其进行转义。

So you can also use: 因此,您还可以使用:

{{(.*?)}}

Sample: 样品:

$ echo {{StackOverflow}} | perl -pe 's/{{(.*?)}}/$1/'
StackOverflow

Also note that the .* which matches any character(other than newline) is used here in non-greedy way. 另请注意,此处以非贪婪方式使用与任何字符(换行符除外)匹配的.* So it'll try to match as little as possible. 因此,它将尝试尽可能少地匹配。

Example: 例:

In the string '{{stack}}{{overflow}}' it will match 'stack' and not 'stack}}{{overflow' . 在字符串'{{stack}}{{overflow}}' ,它将匹配'stack'而不是'stack}}{{overflow'
If you want the later behavior you can change .*? 如果您想要以后的行为,可以更改.*? to .* , making the match greedy. .* ,使比赛变得贪婪。

Your edit shows that you're trying to do a recursive match, which is very different from the original question. 您的编辑显示您正在尝试进行递归匹配,这与原始问题大不相同。 If you weren't just deleting the matched text I would advise you not to use regexes at all, but this should do what you want: 如果您不只是删除匹配的文本,我建议您不要使用正则表达式,但这应该可以满足您的要求:

$wikicode=preg_replace('~{{(?:(?:(?!{{|}}).)++|(?R))*+}}~s',
                       '', $wikicode);

After the first {{ matches an opening delimiter, (?:(?!{{|}}).)++ gobbles up everything until the next delimiter. 在第一个{{匹配一个开始的定界符之后, (?:(?!{{|}}).)++吞噬所有内容,直到下一个定界符为止。 If it's another opening delimiter, the (?R) takes over and applies the whole regex again, recursively. 如果是另一个开放定界符,则(?R)接管并再次递归应用整个正则表达式。

(?R) is about as non-standard as regex features get. (?R)与正则表达式功能所获得的不一样。 It's unique to the PCRE library, which is what powers PHP's regex flavor. 它是PCRE库所独有的,这正是PHP正则表达式风格的强大力量。 Some other flavors have their own ways of matching recursive structures, all of them very different from each other. 其他一些风味也有自己的匹配递归结构的方式,它们彼此之间有很大的不同。

\\{{2}(.*)\\}{2} or, cleaner, with lookarounds (?<=\\{{2}).*(?=\\}{2}) , but only if your regex engine supports them. \\{{2}(.*)\\}{2}或更干净的带有环视(?<=\\{{2}).*(?=\\}{2}) ,但前提是您的正则表达式引擎支持它们。

If you want your match to stop at the first found }} (ie non-greedy) you should replace .* with .*? 如果您希望比赛在第一个找到的}}处停止(即非贪婪),则应将.*替换为.*? .

Also you should take into account the settings for single-line matching of your engine as in some of them . 另外,您还应像其中一些那样考虑到引擎单行匹配的设置. will not match new line characters by default. 默认情况下不会匹配换行符。 You can either enable single-line or use [.\\r\\n]* instead of .* . 您可以启用单行,也可以使用[.\\r\\n]*代替.*

Besides using a already mentioned non-greedy quantifier, you can also use this: 除了使用已经提到的非贪婪量词,您还可以使用以下方法:

\{\{(([^}]|}[^}])*)}}

The inner ([^}]|}[^}])* is used to only match sequences of zero or more arbitrary characters that do not contain the sequence }} . 内部([^}]|}[^}])*仅用于匹配零个或多个不包含序列}}任意字符的序列。

A greedy version to get the shortest match is 最短匹配的贪婪版本是

\{\{([^}]*(?:\}[^}]+)*)\}\}

(For comparison, with the string {{fd}sdfd}sf}x{dsf}} , the lazy version \\{\\{(.*?)\\}\\} takes 57 steps to match, while my version only takes 17 steps. This assumes the debug output of Regex Buddy can be trusted.) (为了进行比较,使用字符串{{fd}sdfd}sf}x{dsf}} ,惰性版本\\{\\{(.*?)\\}\\}需匹配57个步骤,而我的版本仅需17个步骤(假定Regex Buddy的调试输出是可信任的。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM