我如何使用正则表达式或DOM与PHP来获取一片HTML？

Question

If I have a block of HTML and want to get the exact HTML content for certain nodes and child nodes, for example the <ul> block below, should I use something like preg_match or parse the content or something like DOM Parsing ? 如果我有一个HTML块并希望获得某些节点和子节点的确切HTML内容，例如下面的<ul>块，我应该使用类似preg_match的内容还是解析内容或类似DOM Parsing的内容？

Input 输入

<html>
<head>
</head>
<body>
<h2>List</h2>
<ul class="my-list" id="my-list">
    <li class="item first">item1</li>
    <li class="item second">item2</li>
    <li class="item third">item3</li>
</ul>
</body>
</html>

Desired output 期望的输出

<ul class="my-list" id="my-list">
    <li class="item first">item1</li>
    <li class="item second">item2</li>
    <li class="item third">item3</li>
</ul>

As you can see I want to preserve all the attributes (classes, ids, etc). 如您所见，我想保留所有属性（类，ID等）。

I know that with DOM parsing I can access all of those attributes ( $items->item($i)->getAttribute('class') ), but can DOM handle easily (and automatically) rebuilding just a section of the original code without having to manually loop through and build the HTML? 我知道使用DOM解析我可以访问所有这些属性（ $items->item($i)->getAttribute('class') ），但DOM可以轻松（并自动）重建原始代码的一部分无需手动循环并构建HTML？ (I know DOM has echo $DOM->saveXML() , but iI believe that is just for the entire page. （我知道DOM有echo $DOM->saveXML() ，但我相信echo $DOM->saveXML()适用于整个页面。

I know how I can accomplish this with regex and PHP fairly easily, but I'm thinking that is not a good practice. 我知道如何使用正则表达式和PHP很容易实现这一点，但我认为这不是一个好习惯。

This is so simple with jQuery: 使用jQuery这很简单：

jQuery('ul').clone()

How can I achieve the same thing with PHP? 我怎样才能用PHP实现同样的目的？ (grabbing remote HTML, and getting a slice of it using DOM and outputting it as HTML again) （抓取远程HTML，然后使用DOM获取一部分并再次将其输出为HTML）

Answer 1

It's not that bad with dom functions, maybe a bit more verbose than it should be: 它与dom函数并没有那么糟糕，可能比它应该更冗长：

$dom = new DOMDocument();
@$dom->loadHTML($html);
# or 
# @$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
echo $dom->saveXML($xpath->query("//ul")->item(0));

Answer 2

我建议使用DOM解析，因为如果HTML结构发生更改，它将更易于维护，并且比正则表达式更容易理解（读取代码）。

Answer 3

It depends how much you trust in the data source. 这取决于您对数据源的信任程度。 Is it going to be consistent? 它会保持一致吗？ Could there be errors in the markup? 标记中可能有错误吗？ Do you know what to expect? 你知道会发生什么吗？

If it's as simple or relatively close as your sample, I see no reason regex isn't a perfectly valid choice here. 如果它与您的样本一样简单或相对接近，我认为正则表达式不是一个完全有效的选择。

It gets more difficult if, for example, there are multiple <ul> 's. 例如，如果有多个<ul> ，则会变得更加困难。 So long as there is something uniquely identifying it or it is always in the same order, it shouldn't be a problem though. 只要有一些独特的识别它或它总是以相同的顺序，它应该不是一个问题。

我如何使用正则表达式或DOM与PHP来获取一片HTML？

问题描述

Input 输入

Desired output 期望的输出

3 个解决方案

解决方案1
2 已采纳 2012-04-29 09:00:48

解决方案2
1 2012-04-28 03:28:20

解决方案3
0 2012-04-28 03:25:20

我如何使用正则表达式或DOM与PHP来获取一片HTML？

问题描述

Input 输入

Desired output 期望的输出

3 个解决方案

解决方案1 2 已采纳 2012-04-29 09:00:48

解决方案2 1 2012-04-28 03:28:20

解决方案3 0 2012-04-28 03:25:20

解决方案1
2 已采纳 2012-04-29 09:00:48

解决方案2
1 2012-04-28 03:28:20

解决方案3
0 2012-04-28 03:25:20