简体   繁体   English

使用简单HTML DOM解析器从HTML提取数据

[英]Extracting data from HTML using Simple HTML DOM Parser

For a college project, I am creating a website with some back end algorithms and to test these in a demo environment I require a lot of fake data. 对于一个大学项目,我正在创建一个具有一些后端算法的网站,并在演示环境中对其进行测试,我需要大量虚假数据。 To get this data I intend to scrape some sites. 为了获得此数据,我打算抓取一些站点。 One of these sites is freelance.com.To extract the data I am using the Simple HTML DOM Parser but so far I have been unsuccessful in my efforts to actually get the data I need. 这些网站之一是freelance.com。要提取数据,我使用的是简单HTML DOM分析器,但到目前为止,我在获取所需数据方面一直没有成功。

Here is an example of the HTML layout of the page I intend to scrape. 这是我要抓取的页面的HTML布局示例。 The red boxes mark the required data. 红色框标记所需的数据。

Freelance.com上的HTML代码的屏幕截图

Here is the code I have written so far after following some tutorials. 这是我在完成一些教程之后到目前为止编写的代码。

<?php
include "simple_html_dom.php";
// Create DOM from URL
$html = file_get_html('http://www.freelancer.com/jobs/Website-Design/1/');

//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table[id=project_table] tr') as $tr) {

    foreach($tr->find('td[class=title-col]') as $t) {
        //get the inner HTML
        $data = $t->outertext;
        echo $data;
    }
}

?>

Hopefully someone can point me in the right direction as to how I can get this working. 希望有人可以指出正确的方向,告诉我如何使它正常工作。

Thanks. 谢谢。

The raw source code is different, that's why you're not getting the expected results... 原始源代码不同,这就是为什么您没有得到预期结果的原因...

You can check the raw source code using ctrl+u , the data are in table[id=project_table_static] , and the cells td have no attributes, so, here's a working code to get all the URLs from the table: 您可以使用ctrl+u检查原始源代码,数据位于table[id=project_table_static] ,并且单元格td没有属性,因此,这是一个工作代码,可从表中获取所有URL:

$url = 'http://www.freelancer.com/jobs/Website-Design/1/';
// Create DOM from URL
$html = file_get_html($url);

//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table#project_table_static tbody tr') as $i=>$tr) {

    // Skip the first empty element
    if ($i==0) {
        continue;
    }

    echo "<br/>\$i=".$i;

    // get the first anchor
    $anchor = $tr->find('a', 0);
    echo " => ".$anchor->href;
}

// Clear dom object
$html->clear(); 
unset($html);

Demo 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM