简体   繁体   English

使用RegEx(Perl样式)选择另一个标签中未包含的第一段标签

[英]Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

I have this block of html: 我有这个HTML块:

<div>
  <p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>

I'm trying to select the first, non-nested paragraph in that block. 我正在尝试选择该块中的第一个非嵌套段落 I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div. 我正在使用PHP的(perl风格)preg_match来查找它,但似乎无法弄清楚如何忽略div中包含的p标签。

This is what I have so far, but it selects the contents of the first paragraph contained above. 这是我到目前为止的内容,但是它选择了上面包含的第一段的内容。

/<p>(.+?)<\/p>/is

Thanks! 谢谢!

EDIT 编辑

Unfortunately, I don't have the luxury of a DOM Parser. 不幸的是,我没有DOM解析器那么奢侈。

I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. 我完全赞赏不使用RegEx解析HTML的建议,但这并没有真正帮助我的特定用例。 I have a very controlled case where an internal application generated structured text. 我有一个非常受控制的案例,其中内部应用程序生成了结构化文本。 I'm trying to replace some text if it matches a certain pattern. 我正在尝试替换某些匹配特定模式的文本。 This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. 这是一种简化的情况,其中我试图忽略嵌套在其他文本中的文本,而HTML是我能想到的最简单的情况。 My actual case looks something a little more like this (But a lot more data and minified): 我的实际情况看起来像这样(但更多的数据并减少了):

#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]

I have to reformat a certain column of certain rows to a ton of rows similar to that. 我必须将某些行的某一列重新格式化为类似的大量行。 Helping my first question would help actual project. 帮助我的第一个问题将对实际项目有所帮助。

Your regex won't work. 您的正则表达式将无法正常工作。 Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph. 即使您只有非嵌套的段落,捕获的括号也将与“ First, non-nested ... Last paragraph.匹配First, non-nested ... Last paragraph. .

Try: 尝试:

<([^>]+)>([^<]*<(?!/?\\1)[^<]*)*<\\1>

and grab \\2 if \\1 is p . 如果\\1p ,则抓取\\2

But an HTML parser would do a better job of that imho. 但是,HTML解析器可以更好地解决这个问题。

How about something like this? 这样的事情怎么样?

<p>([^<>]+)<\/p>(?=(<[^\/]|$))

Does a look-ahead to make sure it is not inside a closing tag; 先行检查以确保它不在结束标记内; but can be at the end of a string. 但可以在字符串的末尾。 There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice). 查找段落标签中的内容可能是更好的方法,但是您需要避免过于贪婪(。+?不够)。

Use a two three step process. 使用 两个 三步过程。 First, pray that everything is well formed. 首先,祈祷一切都井井有条。 Second, First, remove everything that is nested. 第二, 首先, 删除所有嵌套的东西。

s{<div>.*?</div>}{}g;         # HTML example
s/#.*?#//g;                   # 2nd example

Then get your result. 然后得到你的结果。 Everything that is left is now not nested. 现在剩下的所有内容都不会嵌套。

$result = m{<p>(.*?)</p>};    # HTML example
$result = m{\[(.*?)\]};       # 2nd example

(this is Perl. Don't know how different it would look in PHP). (这是Perl。不知道它在PHP中看起来有什么不同)。

"You shouldn't use regex to parse HTML." “您不应该使用正则表达式来解析HTML。”

It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. 这是每个人都在说的,但没有人真正提供如何实际做事的例子,他们只是在宣讲。 Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it. 好吧,由于Levi Morrison的一些积极性,我决定阅读DomDocument并弄清楚如何做。

To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." 对于每个说“哦,学习解析器太难了,我只会使用正则表达式”。 Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. 好吧, 我之前从未对DomDocument或XPath做过任何事情,这花了我10分钟。 Go read the docs on DomDocument and parse HTML the way you're supposed to. 阅读DomDocument上的文档并按照您应该的方式解析HTML。

$myHtml = <<<MARKUP
   <html>
       <head>
            <title>something</title></head>
       <body>
            <div>
                <p>not valid</p>
            </div>
            <p>is valid</p>
            <p>is not valid</p>
            <p>is not valid either</p>
            <div>
                <p>definitely not valid</p>
            </div>
       </body>
   </html>
MARKUP;

$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));

var_dump($yourNode)

// output '<p>is valid</p>'

You might want to have a look at this post about parsing HTML with Regex. 您可能想看一下有关使用Regex解析HTML的文章

Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. 由于HTML不是正则语言(而正则表达式是),因此无法使用Regex解析HTML的任意块。 Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex. 使用HTML解析器,它将比尝试一起破解某些正则表达式更加顺利。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM