RegEx将找到引用的字符串但不在HTML标记内

Question

I have been looking for a regular expression that will identify a quoted string in the content of an HTML page but NOT if the quotes are part of attributes of HTML tags. 我一直在寻找一个正则表达式，它将在HTML页面的内容中标识一个带引号的字符串，但如果引号是HTML标记的属性的一部分则不是。

Example: 例：

<p id="123">This is some "quoted text" in a <span class="test">sentence.</span></p>

In the above line, I want to find "quoted text" string but not id="123" or class="test". 在上面一行中，我想找到“引用文本”字符串但不是id =“123”或class =“test”。

I have tried a few but none work. 我尝试了一些但没有工作。

The following REGEX picks up the HTML tags in the above example and excludes the sentence content...but I want it to do the opposite: 以下REGEX在上面的示例中选取HTML标记并排除句子内容......但我希望它做相反的事情：

<[^>]+>

Answer 1

If you want to parse HTML to get useful things out of it, use HTMLAgilityPack - it makes it fairly straightforward to do things like this. 如果你想解析HTML以获得有用的东西，可以使用HTMLAgilityPack - 它可以很简单地做这样的事情。

See also: You can't use Regex'es to parse HTML 另请参阅：您不能使用正则表达式来解析HTML

Answer 2

In this particular context, I don't think you're going to have many guarantees. 在这个特定的背景下，我认为你不会有很多保证。 There are too many options for how quoted strings can be put together within a snippet of HTML. 如何在一段HTML中将引用的字符串放在一起有太多选项。 However, based on the specific example you gave above, the following expression would find "quoted text": 但是，根据您在上面给出的具体示例，以下表达式将找到“引用文本”：

(?<=(?:^|>)[^<>]*)"[^"]+"(?=[^<>]*(?:<|$))

RegEx将找到引用的字符串但不在HTML标记内

问题描述

2 个解决方案

解决方案1
3 2013-03-19 14:59:02

解决方案2
0 2013-03-19 15:05:39

RegEx将找到引用的字符串但不在HTML标记内

问题描述

2 个解决方案

解决方案1 3 2013-03-19 14:59:02

解决方案2 0 2013-03-19 15:05:39

解决方案1
3 2013-03-19 14:59:02

解决方案2
0 2013-03-19 15:05:39