简体   繁体   English

在样式内匹配样式

[英]Match pattern within pattern

I'm trying to match any bracketed items within <sup> tags. 我正在尝试匹配<sup>标记内的所有带括号的项目。

My regular expression is being too greedy, starting with the first <sup> tag and ending at the last </sup> tag. 我的正则表达式过于贪婪,从第一个<sup>标记开始,到最后一个</sup>标记结束。

/<sup\b[^>]*>(.*?)\[(.*?)\](.*?)<\/sup>/

Example html: 示例html:

<sup>[this should be gone]</sup>
<sup>but this should stay</sup>
<sup>this should [ also stay</sup>
[and this as well]
<sup><a href="#">[but this should definitely go]</a></sup>

Any idea why? 知道为什么吗?

Thanks! 谢谢!

EDIT: I suppose these answers make sense. 编辑:我想这些答案是有道理的。 I've got much of the HTML parsed without regex; 我已经解析了很多没有正则表达式的HTML。 I just figured that this particular example would work with regex because it would do the following: 我只是想知道此特定示例可与正则表达式一起使用,因为它将执行以下操作:

  1. see the first <sup> tag 看到第一个<sup>标签
  2. find the first instance of </sup> 找到</sup>的第一个实例
  3. search the inside for (wild)(bracket)(wild)(closing bracket)(wild) 在内部搜索(野生)(括号)(野生)(右括号)(野生)

You really can't do this. 你真的不能这样做。 It's impossible to parse HTMl with regular expressions , because regular expressions can only match regular languages; 用正则表达式解析HTM1是不可能的 ,因为正则表达式只能匹配正则语言。 these languages are a simpler subset of the actual languages we use. 这些语言是我们使用的实际语言的一个更简单的子集。 One very common non -regular language is the Dyck language of balanced brackets; 一种非常常见的正规语言是平衡括号的Dyck语言; it's impossible to match correctly nested parentheses with regular expressions. 无法将正确嵌套的括号与正则表达式匹配。 And HTML, if you think about it, is the same as this, with tags replacing parentheses. 如果考虑一下,HTML与此相同,只是用标签代替了括号。 Thus, matching (a) correctly nested sup tags is impossible, and (b) matching balanced braces is impossible. 因此,不可能匹配(a)正确嵌套的sup标签,并且(b)不可能匹配平衡括号。 I don't use PHP myself, but I know it has access to an HTML DOM; 我自己没有使用PHP,但是我知道它可以访问HTML DOM。 I'd recommend using that instead. 我建议改用它。 Then, filter through that for every sup tag, and check each one's inner text. 然后,对每个sup标签进行过滤,并检查每个用户的内部文本。 If you only want to catch tags whose inner text is just [...] , where the ... does not contain square brackets, you can use ^\\[[^\\]]+\\]$ as your regex; 如果只想捕获内部文本仅为[...]标记,而...不含方括号,则可以使用^\\[[^\\]]+\\]$作为正则表达式; if you want real nesting, more complicated checking is necessary. 如果要进行真正的嵌套,则需要进行更复杂的检查。

If your requirement was to specifically remove any text inside " <sup>[ " and " ]</sup >", then you would be ok. 如果您的要求是专门删除“ <sup>[ ”和“ ]</sup >”中的任何文本,那么您可以。 But by your last example, you want to account for a nested tag as well, and probably arbitrary nested tags. 但是,在上一个示例中,您还要考虑嵌套标记,并且可能是任意嵌套标记。 So therefore I must remind you... 所以我必须提醒你......

Don't parse html with regex! 不要用正则表达式解析html!

Isn't it the normal behavior? 这不是正常行为吗? Have you specified the ungreedy option for your regexp? 您是否为正则表达式指定了ungreedy选项

You probably cannot do this with one regular expression. 您可能无法使用一个正则表达式执行此操作。 You will need one that replaces using a callback function, which will run a separate regular expression. 您将需要一个使用回调函数替换的函数,该函数将运行单独的正则表达式。

the better method as everyone has mentioned would be to use a DOM object to parse the HTML first. 每个人提到的更好的方法是首先使用DOM对象来解析HTML。

using regexp to parse html is usually not a very good idea. 使用regexp解析html通常不是一个好主意。

see Parsing Html The Cthulhu Way Parsing Html The Cthulhu Way

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM