简体   繁体   English

正则表达式Lookbehind无法正常工作

[英]Regular Expression Lookbehind doesn't work as expected

I have a string in .net. 我在.net中有一个字符串。

<p class='p1'>Para 1</p><p>Para 2</p><p class="p2">Para 3</p><p>Para 4</p>

Now, I want to get only text inside the tag p (Para 1, Para 2, Para 3, Para4). 现在,我只想在标签p(第1款,第2款,第3款,第4款)中获取文本。

I used the following regular expression but it doesn't give me expected result. 我使用了以下正则表达式,但没有得到预期的结果。

(?<=<p.*>).*?(?=</p>)

If I use (?<=<p>).*?(?=</p>) it will give Para 2 and Para 4 which both p tags doesn't have class attribute? 如果我使用(?<=<p>).*?(?=</p>) ,它将给出第2段和第4段,这两个p标签都没有类属性?

I'd like to know what's wrong with (?<=<p.*>).*?(?=</p>) that code. 我想知道(?<=<p.*>).*?(?=</p>)该代码出了什么问题。

Let's illustrate this using RegexBuddy : 让我们使用RegexBuddy进行说明:

RegexBuddy屏幕截图

Your regex matches more than you think - the dot matches any character, so it doesn't care about tag boundaries. 您的正则表达式比您想象的更匹配-点匹配任何字符,因此它不在乎标记边界。

What it is actually doing: 它实际上在做什么:

  • (?<=<p.*>) : Assert that there is <p (followed by any number of characters) anywhere in the string before the current position, followed by a > . (?<=<p.*>) :断言在当前位置之前的字符串中的任何地方都有<p (后跟任意数量的字符),后跟一个>
  • .*? : Match any number of characters... :匹配任意数量的字符...
  • (?=</p>) : ...until the next occurence of </p> . (?=</p>) :...直到下一次出现</p>

Your question is a bit unclear, but if your plan is to find text within <p> tags regardless of whether they contain any attributes, you shouldn't be using regular expressions anyway but a DOM parser, for example the HTML agility pack . 您的问题还不清楚,但是如果您打算在<p>标记中查找文本,而不管它们是否包含任何属性,那么您不应该使用正则表达式,而应该使用DOM解析器,例如HTML agility pack

That said, if you insist on a regex, try 也就是说,如果您坚持使用正则表达式,请尝试

(?<=<p[^<>]*>)(?:(?!</p>).)*

另一个截图

Explanation: 说明:

(?<=<p[^<>]*>)  # Assert position right after a p tag
(?:(?!</p>).)*  # Match any number of characters until the next </p>

Have you tried using following expression? 您是否尝试过使用以下表达式?

<p[\s\S]*?>(?<text_inside_p>[\s\S]*?)</p>

group named text_inside_p will contain desired text. 名为text_inside_p组将包含所需的文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM