简体   繁体   English

如何使用正则表达式匹配多个段落?

[英]How to use regex to match multiple paragraphs?

I would like to process the html from a webpage and extract the paragraphs that match my criteria.我想处理网页中的 html 并提取符合我的条件的段落。 The flavor of regex is PHP.正则表达式的风格是 PHP。

This is the sample webpage HTML:这是示例网页 HTML:

<div class="special">
    <p>Some interesting text I would like to extract</p>
    <p>More interesting text I would like to extract</p>
    <p>Even more interesting text I would like to extract</p>
</div>

The regex looks between the <div class="special"> and </div> tags and puts everything into a capture group or variable for reference in the next step.正则表达式在<div class="special"></div>标记之间查找,并将所有内容放入捕获组或变量中以供下一步参考。 This next step is what I am having trouble with.下一步是我遇到的问题。 I cannot for the life of me write a regex that captures each paragraph of text between <p> and </p> .我一生都无法编写一个正则表达式来捕获<p></p>之间的每一段文本。

I have tried /<p>(.+?)<\\/p>/s which returns:我试过/<p>(.+?)<\\/p>/s返回:

<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>

I would like each paragraph to be returned individually as items in an array.我希望每个段落都作为数组中的项目单独返回。 The non greedy ?非贪婪? does not seem to work.似乎不起作用。 Any suggestions?有什么建议?

You have to escape your slash for the p tag.您必须为 p 标签转义斜杠。

So it's going to be所以这将是

/<p>(.+?)<\/p>/s

So stupid!那么蠢! The regex works perfectly.正则表达式完美运行。 All the regexs work perfectly.所有的正则表达式都能完美地工作。 The problem was with the inputs.问题在于输入。 The input HTML file I was processing had the following structure which made the regex not work.我正在处理的输入 HTML 文件具有以下结构,这使得正则表达式不起作用。

<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>

I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data.我使用 var_dump(htmlfile.html) 查看我得到的 HTML 页面,但我的浏览器处理了它,所以我没有得到原始数据。 I was able to get the raw data and find my mistake by using:我能够通过使用以下方法获取原始数据并找到我的错误:

include 'filename.php'; 
file_put_contents('filename.php', $data);

Now I know to not trust my browser to return raw data ever again!现在我知道不要相信我的浏览器会再次返回原始数据!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM