如何使用正则表达式匹配多个段落？

Question

I would like to process the html from a webpage and extract the paragraphs that match my criteria.我想处理网页中的 html 并提取符合我的条件的段落。 The flavor of regex is PHP.正则表达式的风格是 PHP。

This is the sample webpage HTML:这是示例网页 HTML：

<div class="special">
    <p>Some interesting text I would like to extract</p>
    <p>More interesting text I would like to extract</p>
    <p>Even more interesting text I would like to extract</p>
</div>

The regex looks between the <div class="special"> and </div> tags and puts everything into a capture group or variable for reference in the next step.正则表达式在<div class="special">和</div>标记之间查找，并将所有内容放入捕获组或变量中以供下一步参考。 This next step is what I am having trouble with.下一步是我遇到的问题。 I cannot for the life of me write a regex that captures each paragraph of text between <p> and </p> .我一生都无法编写一个正则表达式来捕获<p>和</p>之间的每一段文本。

I have tried /<p>(.+?)<\\/p>/s which returns:我试过/<p>(.+?)<\\/p>/s返回：

<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>

I would like each paragraph to be returned individually as items in an array.我希望每个段落都作为数组中的项目单独返回。 The non greedy ?非贪婪? does not seem to work.似乎不起作用。 Any suggestions?有什么建议？

Answer 1

You have to escape your slash for the p tag.您必须为 p 标签转义斜杠。

So it's going to be所以这将是

/<p>(.+?)<\/p>/s

Answer 2

So stupid!那么蠢！ The regex works perfectly.正则表达式完美运行。 All the regexs work perfectly.所有的正则表达式都能完美地工作。 The problem was with the inputs.问题在于输入。 The input HTML file I was processing had the following structure which made the regex not work.我正在处理的输入 HTML 文件具有以下结构，这使得正则表达式不起作用。

<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>

I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data.我使用 var_dump(htmlfile.html) 查看我得到的 HTML 页面，但我的浏览器处理了它，所以我没有得到原始数据。 I was able to get the raw data and find my mistake by using:我能够通过使用以下方法获取原始数据并找到我的错误：

include 'filename.php'; 
file_put_contents('filename.php', $data);

Now I know to not trust my browser to return raw data ever again!现在我知道不要相信我的浏览器会再次返回原始数据！

如何使用正则表达式匹配多个段落？

问题描述

2 个解决方案

解决方案1
1 2016-04-16 02:22:47

解决方案2
0 已采纳 2016-04-16 15:38:05

如何使用正则表达式匹配多个段落？

问题描述

2 个解决方案

解决方案1 1 2016-04-16 02:22:47

解决方案2 0 已采纳 2016-04-16 15:38:05

解决方案1
1 2016-04-16 02:22:47

解决方案2
0 已采纳 2016-04-16 15:38:05