简体   繁体   English

RegEx将找到引用的字符串但不在HTML标记内

[英]RegEx that will find quoted strings but NOT inside HTML tags

I have been looking for a regular expression that will identify a quoted string in the content of an HTML page but NOT if the quotes are part of attributes of HTML tags. 我一直在寻找一个正则表达式,它将在HTML页面的内容中标识一个带引号的字符串,但如果引号是HTML标记的属性的一部分则不是。

Example: 例:

<p id="123">This is some "quoted text" in a <span class="test">sentence.</span></p>

In the above line, I want to find "quoted text" string but not id="123" or class="test". 在上面一行中,我想找到“引用文本”字符串但不是id =“123”或class =“test”。

I have tried a few but none work. 我尝试了一些但没有工作。

The following REGEX picks up the HTML tags in the above example and excludes the sentence content...but I want it to do the opposite: 以下REGEX在上面的示例中选取HTML标记并排除句子内容......但我希望它做相反的事情:

<[^>]+>

If you want to parse HTML to get useful things out of it, use HTMLAgilityPack - it makes it fairly straightforward to do things like this. 如果你想解析HTML以获得有用的东西,可以使用HTMLAgilityPack - 它可以很简单地做这样的事情。

See also: You can't use Regex'es to parse HTML 另请参阅: 您不能使用正则表达式来解析HTML

In this particular context, I don't think you're going to have many guarantees. 在这个特定的背景下,我认为你不会有很多保证。 There are too many options for how quoted strings can be put together within a snippet of HTML. 如何在一段HTML中将引用的字符串放在一起有太多选项。 However, based on the specific example you gave above, the following expression would find "quoted text": 但是,根据您在上面给出的具体示例,以下表达式将找到“引用文本”:

(?<=(?:^|>)[^<>]*)"[^"]+"(?=[^<>]*(?:<|$))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM