简体   繁体   English

适用于所有文本的正则表达式公式,包括HTML标记内,而不是'a'标记内

[英]Regex formula for all text including inside HTML tags, but not in 'a' tags

I am struggling with the formula to replace text in a string, while looking in all tags apart from 'a' tags (links). 我正在寻找替换字符串中的文本的公式,同时查找除“ a”标签(链接)之外的所有标签。

This is my current formula: 这是我目前的公式:

\\b(?<!-)text\\b(?<!<[^<>]*)(?!-|[^<>]*>)

Example, I would like to replace only the instances of 'text' outside of the 'a' tag, so within the 'p' and 'li' tags: 例如,我只想替换'a'标记之外的'text'实例,所以要替换'p'和'li'标记内的实例:

<p>
    Some text here. <a href="#">Some more text</a>
    <ul>
        <li>Other text</li>
    </ul>
</p>

It must also match on whole words including dashes, which the formula currently does successfully. 它也必须与包括破折号在内的整个单词匹配,该公式当前可以成功执行此操作。 It also (I believe) doesn't replace anything within the tags themselves, ie: (我相信)它也不会替换标签本身内的任何内容,即:

<li class="text">here some text</li>

It would not replace the 'text' in the class name. 它不会替换类名中的“文本”。

This formula should do what you want: 该公式应满足您的要求:

(?:\A|<\/a>)(?:[^<]|<[^a])*?(text)

The results of the match (meaning text in this case) are stored in the corresponding group (1) of the match. 匹配结果(在这种情况下为意思文本 )存储在匹配的相应组(1)中。 Note however, that this does fail for example if there are links inside the link, as mentioned in the comments, or if you have weird strings inside your document. 但是请注意,例如,如注释中所述,如果链接中有链接,或者文档中有奇怪的字符串,则这样做确实会失败。 Also note that this only matches up to the first match in an area outside of a link - if you want to replace something, you'll have to run the expression multiple times. 还要注意,这仅匹配链接外部区域中的第一个匹配项-如果要替换某些内容,则必须多次运行表达式。 If you only have a maximum of 2 occurances of text in such an area, you can combine this query with this one: 如果在这样的区域中最多只出现两次文本 ,则可以将此查询与以下查询结合使用:

(?:\A|<\/a>)(?:[^<]|<[^a])*(text)

which matches the last text . 匹配最后一个文本

I'm aware this isn't an optimal solution, so I'm open to improvement suggestions. 我知道这不是最佳解决方案,因此我愿意提出改进建议。 However, I'm not sure whether Regex is the best solution here (as Yunnosch pointed out), as Regex does only have a limited power (it is, after all, a type-3 grammar). 但是,我不确定Regex是否是最好的解决方案(如Yunnosch所指出的),因为Regex的功能有限(毕竟,它是3类语法)。

Let me explain the formula: 让我解释一下公式:

  • (?:\\A|<\\/a>) match either the start of the input or the end of a link (?:\\A|<\\/a>)匹配输入的开头或链接的结尾
  • (?:[^<]|<[^a])* match everything that is not an indication of a tag, or - if so - if it's at least not the start of a link (?:[^<]|<[^a])*匹配所有不表示标记的内容,或者-如果是-至少不是链接的开头
  • ? match as little as possible (hence the first match) 尽可能少地比赛(因此第一场比赛)
  • (text) match the actual text (and save it in the group) (text)匹配实际文本(并将其保存在组中)

You can access the groups for example by using $1 or /1 , depending on your environment. 您可以使用$1/1访问组,具体取决于您的环境。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM