简体   繁体   中英

How to use regex to match multiple paragraphs?

I would like to process the html from a webpage and extract the paragraphs that match my criteria. The flavor of regex is PHP.

This is the sample webpage HTML:

<div class="special">
    <p>Some interesting text I would like to extract</p>
    <p>More interesting text I would like to extract</p>
    <p>Even more interesting text I would like to extract</p>
</div>

The regex looks between the <div class="special"> and </div> tags and puts everything into a capture group or variable for reference in the next step. This next step is what I am having trouble with. I cannot for the life of me write a regex that captures each paragraph of text between <p> and </p> .

I have tried /<p>(.+?)<\\/p>/s which returns:

<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>

I would like each paragraph to be returned individually as items in an array. The non greedy ? does not seem to work. Any suggestions?

You have to escape your slash for the p tag.

So it's going to be

/<p>(.+?)<\/p>/s

So stupid! The regex works perfectly. All the regexs work perfectly. The problem was with the inputs. The input HTML file I was processing had the following structure which made the regex not work.

<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>

I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data. I was able to get the raw data and find my mistake by using:

include 'filename.php'; 
file_put_contents('filename.php', $data);

Now I know to not trust my browser to return raw data ever again!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM