简体繁体 English

如何从原始HTML文件提取数据？

[英]How to extract data from a raw HTML file?

原文 2009-11-30 17:13:41 5 5 php/ html/ parsing/ html-content-extraction

Is there a way to extract desired data from a raw html which has been written unsemantically with no IDs and classes ? 有没有一种方法可以从原始的HTML中提取所需的数据，而这些原始HTML是没有IDs和classes ，因此无法正确编写？ I mean, suppose there is a saved html file of a webpage (profile) and I want to extract the data like (say) 'hobbies'. 我的意思是，假设存在网页（配置文件）的已保存html文件，并且我想提取诸如“爱好”之类的数据。 Is it possible to do this using PHP? 是否可以使用PHP做到这一点？

5 个解决方案

BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ ，也许吗？

Sounds like you're looking for a PHP DOM Parser, such as this one . 听起来您正在寻找PHP DOM解析器，例如this 。 It'll probably be a bit tricky to pull out the data you need if the HTML is truly devoid of semantic structure, but a DOM parser is the place to start. 如果HTML确实没有语义结构，那么提取所需的数据可能会有些棘手，但是DOM解析器是起点。

Yes the technique is called web scraping . 是的，该技术被称为刮网。 You could use the DOM if its valid html. 如果DOM有效，则可以使用DOM。 If the page is dynamically generated the generator would have used some structure, and from my experience you can always isolate elements of interest. 如果页面是动态生成的，则生成器将使用某种结构，根据我的经验，您始终可以隔离感兴趣的元素。

If DOM does not work for you, you can just use regular expressions (thats what I always used to do when writing web-spiders). 如果DOM对您不起作用，则可以使用正则表达式（这就是编写Web蜘蛛时我经常使用的表达式）。 Regular expressions are more effective and quicker that writing scraping logic against a DOM heirarchy. 与针对DOM层次结构编写抓取逻辑相比，正则表达式更有效，更快捷。 So you need to open a few of the profile pages and analyze the static structure. 因此，您需要打开一些配置文件页面并分析静态结构。 Then just write a regular expression to isolate the fields of interest. 然后只需编写一个正则表达式即可隔离感兴趣的字段。

Use regex ! 使用正则表达式！ I kid, I kid. 我开玩笑，我开玩笑。 If you know the state of the same page, and the format is guaranteed to remain similar enough, then you can try writing a manual parser. 如果您知道同一页面的状态，并且可以保证格式足够相似，则可以尝试编写手动解析器。 Alternatively, there are a lot of libraries out there that will parse html for. 另外，也有很多库可以解析html。 I'm not familiar enough with PHP to recommend one, but I'm sure some Googleing could take you a long way. 我对PHP不太熟悉，无法推荐一个，但是我敢肯定，某些Googleing可以带给您很多帮助。 I've had luck with John Resig's pure javascript HTML parser before. 之前，我对John Resig的纯JavaScript HTML解析器很幸运。

At the end of the day, if you need semantic information from an html page that isn't constructed semantically, you're probably doomed programmatically and your best bet may be a mechanical turk . 归根结底，如果您需要从不是以语义方式构建的html页面中获取语义信息，那么您可能会在编程上注定要失败，并且最好的选择可能是机械特克。

There's two approaches to take with PHP. PHP有两种方法。 The first is to clean your document up using the tidy extension so it's valid XHTML, and therefore well-formed XML, and therefore can be parsed using XML tools. 首先是使用整洁的扩展名清理文档，以便它是有效的XHTML，因此是格式正确的XML，因此可以使用XML工具进行解析。

The second is to use the PHP release of html5lib parser, which attempts to implement the HTML5 research into current browser parsing routines. 第二种是使用PHP版本的html5lib解析器，该解析器试图将HTML5研究实施到当前的浏览器解析例程中。 If it displays in a browser, html5lib can parse it. 如果它显示在浏览器中，则html5lib可以对其进行解析。

Using either approach you'll end up with a DOM object you can query using xpath expressions. 使用这两种方法，您最终都会得到一个DOM对象，您可以使用xpath表达式进行查询。 Since your theoretical documents lack semantic structure, you'll want toook at the document parts from a "the 5th span inside the 3rd p" mindset. 由于您的理论文档缺乏语义结构，因此您将需要从“ 3rd p内的第5跨度”的思维方式来查看文档部分。

More information here (self-link warning). 此处有更多信息（自链接警告）。