简体   繁体   English

使用PHP DOM函数从HTML文件提取数据的最佳方法是什么?

[英]What is the best way to extract data from an HTML file using the PHP DOM functions?

I need to extract large amounts of data from a variety of HTML files, and I will have to write a separate script for each type of HTML file in order to parse out the data I need correctly. 我需要从各种HTML文件中提取大量数据,并且我将不得不为每种类型的HTML文件编写一个单独的脚本,以便正确解析出我需要的数据。

The data will be located in different parts of the document - for example, in document type one, the data I need may be nicely inside a DIV with an ID, but on document type two the only way to locate the data I need may be by finding the certain pattern of tags that contains it (like <div><b>DATA</div></b> ). 数据将位于文档的不同部分-例如,在文档类型一中,我需要的数据可能恰好在具有ID的DIV内,但是在文档类型二中,查找所需数据的唯一方法可能是通过查找包含它的标签的特定模式(例如<div><b>DATA</div></b> )。

From the little I've been able to find so far it seems that DOMXPath may be able to help me with at least some of the extraction - what other functions can I use, specifically on the second example of locating an arbitrary pattern of tags and getting their content? 从到目前为止我只能找到的一点点来看,似乎DOMXPath至少可以帮助我进行一些提取-我还可以使用其他功能,特别是在第二个示例中,该示例定位标签的任意模式和得到他们的内容?

If you are extracting different types of data from a variety of HTML files, you are going to tire quickly from using the DOMDocument API and XPath. 如果要从各种HTML文件中提取不同类型的数据,则使用DOMDocument API和XPath会使您很快感到厌倦。 Use one of the wrapper libraries listed in How do you parse and process HTML/XML in PHP? 使用如何在PHP中解析和处理HTML / XML中列出的包装器库之一 . They provide a richer API and additional selectors. 它们提供了更丰富的API和其他选择器。

I'm preferring phpQuery and QueryPath which allow for: 我更喜欢phpQuery和QueryPath ,它们允许:

print qp($url)->find("body p.article a")->attr("href");

print qp($html)->find("div b")->text();

The usable functions are documented here: http://api.querypath.org/docs/class_query_path.html - it's mostly like jQuery. 此处记录了可用的函数: http : //api.querypath.org/docs/class_query_path.html-与jQuery最为相似。

If you plan on parsing many HTML files and you need to select or modify many elements of your HTML files, consider using a library. 如果计划解析许多HTML文件,并且需要选择或修改HTML文件的许多元素,请考虑使用库。

I would recommend the library PHPPowertools/DOM-Query , which I wrote myself. 我会推荐我自己编写的库PHPPowertools/DOM-Query It allows you to (1) load an HTML file and then (2) select or change parts of your HTML pretty much the same way you'd do it if you'd be using jQuery in a frontend app. 它允许您(1)加载HTML文件,然后(2)与在前端应用程序中使用jQuery时几乎相同的方式选择或更改HTML的某些部分。

Example use : 使用示例:

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function($i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function($i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM