简体   繁体   中英

How can i use Regex or DOM with PHP to get a slice of HTML?

If I have a block of HTML and want to get the exact HTML content for certain nodes and child nodes, for example the <ul> block below, should I use something like preg_match or parse the content or something like DOM Parsing ?

Input

<html>
<head>
</head>
<body>
<h2>List</h2>
<ul class="my-list" id="my-list">
    <li class="item first">item1</li>
    <li class="item second">item2</li>
    <li class="item third">item3</li>
</ul>
</body>
</html>

Desired output

<ul class="my-list" id="my-list">
    <li class="item first">item1</li>
    <li class="item second">item2</li>
    <li class="item third">item3</li>
</ul>

As you can see I want to preserve all the attributes (classes, ids, etc).

I know that with DOM parsing I can access all of those attributes ( $items->item($i)->getAttribute('class') ), but can DOM handle easily (and automatically) rebuilding just a section of the original code without having to manually loop through and build the HTML? (I know DOM has echo $DOM->saveXML() , but iI believe that is just for the entire page.

I know how I can accomplish this with regex and PHP fairly easily, but I'm thinking that is not a good practice.

This is so simple with jQuery:

jQuery('ul').clone()

How can I achieve the same thing with PHP? (grabbing remote HTML, and getting a slice of it using DOM and outputting it as HTML again)

It's not that bad with dom functions, maybe a bit more verbose than it should be:

$dom = new DOMDocument();
@$dom->loadHTML($html);
# or 
# @$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
echo $dom->saveXML($xpath->query("//ul")->item(0));

我建议使用DOM解析,因为如果HTML结构发生更改,它将更易于维护,并且比正则表达式更容易理解(读取代码)。

It depends how much you trust in the data source. Is it going to be consistent? Could there be errors in the markup? Do you know what to expect?

If it's as simple or relatively close as your sample, I see no reason regex isn't a perfectly valid choice here.

It gets more difficult if, for example, there are multiple <ul> 's. So long as there is something uniquely identifying it or it is always in the same order, it shouldn't be a problem though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM