[英]How to use regex to match multiple paragraphs?
I would like to process the html from a webpage and extract the paragraphs that match my criteria.我想处理网页中的 html 并提取符合我的条件的段落。 The flavor of regex is PHP.
正则表达式的风格是 PHP。
This is the sample webpage HTML:这是示例网页 HTML:
<div class="special">
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
</div>
The regex looks between the <div class="special">
and </div>
tags and puts everything into a capture group or variable for reference in the next step.正则表达式在
<div class="special">
和</div>
标记之间查找,并将所有内容放入捕获组或变量中以供下一步参考。 This next step is what I am having trouble with.下一步是我遇到的问题。 I cannot for the life of me write a regex that captures each paragraph of text between
<p>
and </p>
.我一生都无法编写一个正则表达式来捕获
<p>
和</p>
之间的每一段文本。
I have tried /<p>(.+?)<\\/p>/s
which returns:我试过
/<p>(.+?)<\\/p>/s
返回:
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
I would like each paragraph to be returned individually as items in an array.我希望每个段落都作为数组中的项目单独返回。 The non greedy
?
非贪婪
?
does not seem to work.似乎不起作用。 Any suggestions?
有什么建议?
You have to escape your slash for the p tag.您必须为 p 标签转义斜杠。
So it's going to be所以这将是
/<p>(.+?)<\/p>/s
So stupid!那么蠢! The regex works perfectly.
正则表达式完美运行。 All the regexs work perfectly.
所有的正则表达式都能完美地工作。 The problem was with the inputs.
问题在于输入。 The input HTML file I was processing had the following structure which made the regex not work.
我正在处理的输入 HTML 文件具有以下结构,这使得正则表达式不起作用。
<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>
I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data.我使用 var_dump(htmlfile.html) 查看我得到的 HTML 页面,但我的浏览器处理了它,所以我没有得到原始数据。 I was able to get the raw data and find my mistake by using:
我能够通过使用以下方法获取原始数据并找到我的错误:
include 'filename.php';
file_put_contents('filename.php', $data);
Now I know to not trust my browser to return raw data ever again!现在我知道不要相信我的浏览器会再次返回原始数据!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.