简体   繁体   English

正则表达式排除HTML标记中包含的匹配项

[英]Regex excluding matches contained within a HTML tag

I'm trying to create a Regex expression to match content within a HTML document, but I wish to exclude matches contained within a tag itself. 我正在尝试创建一个Regex表达式来匹配HTML文档中的内容,但是我希望排除标签本身中包含的匹配项。 Consider the following: 考虑以下:

<p>Here is some sample text for my widgets</p>
<a href="http://mywidgets.nowhere">Click here to view my widgets</a>

I would like to match 'widgets' so that I can replace it with a different string, say 'green box', without replacing the match within the url. 我想匹配“小部件”,以便可以用其他字符串(例如“绿框”)替换它,而无需替换url中的匹配项。

Matching 'widgets' is, well, easy as anything, but I'm struggling to add the exclude to check for 'widgets' when it appears within the opening and closing tag '<>'. 匹配“窗口小部件”非常容易,但是我正在努力添加排除项,以在开始和结束标记“ <>”中显示“窗口小部件”时进行检查。

My current workings: As a first step I have started to match 'widgets' contained within '<>'. 我目前的工作方式:第一步,我开始匹配“ <>”中包含的“小部件”。 (I can then move on to make this an exclude later) However the below string seems to match the whole document, even though I have placed an exclude on the closing > to make sure widgets appears within a tag. (然后,我可以继续以使其成为排除对象)。尽管我在结束>上放置了排除对象以确保小部件出现在标记中,但以下字符串似乎与整个文档匹配。

<.*[^>]widgets.*[^<]>+ 

It's probably down to lazy / greedy, but I can't quite work it out! 可能归结为懒惰/贪婪,但我无法完全解决!

Overview 总览

By no means is this a great answer since it's parsing HTML with regex, but it does work for the test case given by the OP. 绝不是一个好答案,因为它使用正则表达式解析HTML,但是它确实适用于OP给出的测试用例。

See RegEx match open tags except XHTML self-contained tags for more information. 有关更多信息,请参见RegEx匹配开放标签,但XHTML自包含标签除外


Code

See regex in use here 查看正则表达式在这里使用

(?<!<[^>]*)widgets

Explanation 说明

  • (?<!<[^>]*) Negative lookbehind ensuring what precedes is not < followed by any character except > (any number of times) (?<!<[^>]*)确保后面的否定后视不是<后跟任何字符,除了> (任意次数)
  • widgets Match this literally widgets逐字匹配

This may partially work: 这可能部分起作用:

(?:^|>)[^<]*widgets

This will start looking from the start of a line (if the /m flag is used) or the end of a tag (so we know we are not in one), and advance as many characters possible that are not <, meaning you can't open another tag, before looking for widgets. 这将从行首(如果使用了/m标志)或标记的末尾(因此我们知道我们不在同一行)开始寻找,并尽可能多地添加不是<的字符,这意味着您可以在寻找小部件之前,请不要打开另一个标签。 The issues with this are that it may give weird results if you have a > inside a tag (eg, in javascript), or if a single tag can span over multiple lines and it won't find several instances of "widgets" in the same substring. 问题是,如果您在标记内有一个>(例如,在javascript中),或者单个标记可以跨越多行,并且在标记中找不到多个“窗口小部件”实例,则可能会产生奇怪的结果。相同的子字符串。 To solve those issue, you'd better use an actual XML parser as advised by ctwheels 为了解决这些问题,您最好使用ctwheels建议的实际XML解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM