简体   繁体   English

从网站检索特定数据

[英]retrieving specific data from a website

I am currently building a scraper to scrape certain information from a website. 我目前正在建立一个刮刀来从网站上抓取某些信息。

For example, I would like to get a restaurant name, address, opening hours & telephone number from a website. 例如,我想从网站上获取餐馆名称,地址,营业时间和电话号码。

By using curl, I managed to get the data from the website: 通过使用curl,我设法从网站获取数据:

    $url = "http://localhost/test.html";
    $ch = curl_init(); 
    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    $data = curl_exec($ch); 
    curl_close($ch);

However, I need some ideas on how would I be able to pin point my scraper to the exact location to scrape these information out. 但是,我需要一些想法,我怎么能将我的刮刀指向确切的位置以刮掉这些信息。

I have tried regular expressions, but was unable to get it to work. 我尝试过正则表达式,但无法使其正常工作。

Use SimpleHTMLDom parser for php: 对php使用SimpleHTMLDom解析器:
http://simplehtmldom.sourceforge.net/ http://simplehtmldom.sourceforge.net/

Download here: 在这里下载:
http://sourceforge.net/projects/simplehtmldom/files/ http://sourceforge.net/projects/simplehtmldom/files/

Documentation here: 文档在这里:
http://simplehtmldom.sourceforge.net/manual.htm http://simplehtmldom.sourceforge.net/manual.htm

That is as I have experience with parsing the best tool for parsing HTML with php... 这就像我有解析使用php解析HTML的最佳工具的经验...

Also you don't need to use curl for getting content if it is not necessary, for simpleHTMLDom parser just use: 如果没有必要,你也不需要使用curl来获取内容,因为simpleHTMLDom解析器只是使用:

$remote_html = file_get_html("http://www.somesite.com/");

Take a look at XPath querying: http://php.net/manual/en/domxpath.query.php 看看XPath查询: http://php.net/manual/en/domxpath.query.phphttp://php.net/manual/en/domxpath.query.php

I use the equivalant method for website scraping in C#. 我在C#中使用等效方法进行网站抓取。 Same standard employed here. 这里使用相同的标准。 Most excellent. 最优秀。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM