如何使用简单的 html dom 解析器从抓取中抓取特定数据

Question

I am trying to scrape the datas from a webpage, but I get need to get all the data in this link .我正在尝试从网页中抓取数据，但我需要获取此链接中的所有数据。

include 'simple_html_dom.php';
$html1 = file_get_html('http://www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');

$info1 = $html1->find('b[class=[what to enter herer ]',0);

I need to get all the data out of this site .我需要从这个站点获取所有数据。

Bürgerstiftung Lebensraum Aachen
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Hubert Schramm
    Alexanderstr. 69/ 71
    52062 Aachen
    Telefon: 0241 - 4500130
    Telefax: 0241 - 4500131
    Email: info@buergerstiftung-aachen.de
    www.buergerstiftung-aachen.de
    >> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Helga Kühn
    Rotkehlchenstr. 72
    28832 Achim
    Telefon: 04202-84981
    Telefax: 04202-955210
    Email: info@buergerstiftung-achim.de
    www.buergerstiftung-achim.de
    >> Weitere Details zu dieser Stiftung

I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie??我需要拥有链接“背后”的数据 - 有没有办法使用简单易懂的解析器来做到这一点 - 一个新手可以理解和编写的解析器？

Answer 1

Your provided links are down, I will suggest you to use the native PHP " DOM " Extension instead of "simple html parser", it will be much faster and easier;) I had a look at the page using googlecache, you can use something like:-您提供的链接已关闭，我建议您使用本机 PHP“ DOM ”扩展而不是“简单的 html 解析器”，它会更快更容易；）我查看了使用 googlecache 的页面，你可以使用一些东西喜欢：-

$doc = new DOMDocument;
@$doc->loadHTMLFile('...URL....'); // Using the @ operator to hide parse errors
$contents = $doc->getElementById('content')->nodeValue; // Text contents of #content

Answer 2

Seems to be written in the documentation :似乎写在文档中：

$html1->find('b[class=info]',0)->innertext;

Answer 3

From what i can quickly glance you need to loop through the <dl> tags in #content, then the dt and dd.从我可以快速浏览的内容来看，您需要遍历#content 中的<dl> 标签，然后是dt 和dd。

foreach ($html->find('#content dl') as $item) {
     $info = $item->find('dd');
     foreach ($info as $info_item) {..}
}

Using the simple_html_dom library使用 simple_html_dom 库

Answer 4

XPath makes scraping ridiculously easy, and allows for some changes in the HTML document to not affect you. XPath 使抓取变得非常容易，并允许 HTML 文档中的一些更改不会影响您。 For example, to pull out the names, you'd use a query that looks like:例如，要提取名称，您可以使用如下查询：

//div[id='content']/d1/dt

A simple Google search will give you plenty of tutorials一个简单的谷歌搜索会给你大量的教程

Answer 5

@zero: there is good site to try out scrapping a site using both php and python...pretty helpful site atleast to me:- http://scraperwiki.com/ @zero：有一个很好的网站可以尝试同时使用 php 和 python ......至少对我来说非常有用的网站：- http://scraperwiki.com

Answer 6

I'd use WWW:Mechanize我会使用 WWW:Mechanize

http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm

如何使用简单的 html dom 解析器从抓取中抓取特定数据

问题描述

6 个解决方案

解决方案1
7 2011-05-28 06:30:27

解决方案2
2 已采纳 2011-05-24 17:29:32

解决方案3
2 2011-05-26 20:32:25

解决方案4
1 2011-06-02 16:14:11

解决方案5
1 2011-06-02 17:49:16

解决方案6
-1 2011-06-02 01:26:34

如何使用简单的 html dom 解析器从抓取中抓取特定数据

问题描述

6 个解决方案

解决方案1 7 2011-05-28 06:30:27

解决方案2 2 已采纳 2011-05-24 17:29:32

解决方案3 2 2011-05-26 20:32:25

解决方案4 1 2011-06-02 16:14:11

解决方案5 1 2011-06-02 17:49:16

解决方案6 -1 2011-06-02 01:26:34

解决方案1
7 2011-05-28 06:30:27

解决方案2
2 已采纳 2011-05-24 17:29:32

解决方案3
2 2011-05-26 20:32:25

解决方案4
1 2011-06-02 16:14:11

解决方案5
1 2011-06-02 17:49:16

解决方案6
-1 2011-06-02 01:26:34