从网站检索特定数据

Question

I am currently building a scraper to scrape certain information from a website. 我目前正在建立一个刮刀来从网站上抓取某些信息。

For example, I would like to get a restaurant name, address, opening hours & telephone number from a website. 例如，我想从网站上获取餐馆名称，地址，营业时间和电话号码。

By using curl, I managed to get the data from the website: 通过使用curl，我设法从网站获取数据：

    $url = "http://localhost/test.html";
    $ch = curl_init(); 
    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    $data = curl_exec($ch); 
    curl_close($ch);

However, I need some ideas on how would I be able to pin point my scraper to the exact location to scrape these information out. 但是，我需要一些想法，我怎么能将我的刮刀指向确切的位置以刮掉这些信息。

I have tried regular expressions, but was unable to get it to work. 我尝试过正则表达式，但无法使其正常工作。

Answer 1

Use SimpleHTMLDom parser for php: 对php使用SimpleHTMLDom解析器：
http://simplehtmldom.sourceforge.net/ http://simplehtmldom.sourceforge.net/

Download here: 在这里下载：
http://sourceforge.net/projects/simplehtmldom/files/ http://sourceforge.net/projects/simplehtmldom/files/

Documentation here: 文档在这里：
http://simplehtmldom.sourceforge.net/manual.htm http://simplehtmldom.sourceforge.net/manual.htm

That is as I have experience with parsing the best tool for parsing HTML with php... 这就像我有解析使用php解析HTML的最佳工具的经验...

Also you don't need to use curl for getting content if it is not necessary, for simpleHTMLDom parser just use: 如果没有必要，你也不需要使用curl来获取内容，因为simpleHTMLDom解析器只是使用：

$remote_html = file_get_html("http://www.somesite.com/");

Answer 2

Take a look at XPath querying: http://php.net/manual/en/domxpath.query.php 看看XPath查询： http://php.net/manual/en/domxpath.query.php ： http://php.net/manual/en/domxpath.query.php

I use the equivalant method for website scraping in C#. 我在C＃中使用等效方法进行网站抓取。 Same standard employed here. 这里使用相同的标准。 Most excellent. 最优秀。

从网站检索特定数据

问题描述

2 个解决方案

解决方案1
3 已采纳 2012-10-05 12:48:30

解决方案2
1 2012-10-05 12:49:22

从网站检索特定数据

问题描述

2 个解决方案

解决方案1 3 已采纳 2012-10-05 12:48:30

解决方案2 1 2012-10-05 12:49:22

解决方案1
3 已采纳 2012-10-05 12:48:30

解决方案2
1 2012-10-05 12:49:22