简体   繁体   English

抓取HTML表格数据并创建XML或JSON文档

[英]Scrape HTML table data and create XML or JSON doc

I need to scrape some website data from a table on a website and create an XML or JSON document that will be used by an app. 我需要从网站上的表中抓取一些网站数据,并创建将由应用程序使用的XML或JSON文档。 and i have some problem getting the below data. 我得到以下数据有一些问题。

The table looks like this: 该表如下所示:

<table border="0" cellpadding="3" cellspacing="0" bgcolor="#DDEEFF" width="100%">
<tr>
<td width="20%" ><font face="Verdana, Arial" size="1">SRC</a></td></font>
    <td width="58%" ><font face="Verdana, Arial" size="1"><a href="http://example.com/this/news?id=1&by=today" onMouseOver="a('Open Bulletin');return true" onMouseOut="b()">Welcome</font></a></td>
<td width="17%" align="center"><font face="Verdana, Arial" size="1">Event</td></font>
    </tr>   

<tr>
<td width="20%" ><font face="Verdana, Arial" size="1">FMD</a></td></font>
    <td width="58%" ><font face="Verdana, Arial" size="1"><a href="http://example.com/this/news?id=2&by=today" onMouseOver="a('Open Bulletin');return true" onMouseOut="b()">Another News</font></a></td>
<td width="17%" align="center"><font face="Verdana, Arial" size="1">Updates</td></font>
    </tr>   
</td>

And I would like to create an XML feed or JSON that looks like this: 我想创建一个如下所示的XML feed或JSON:

<bulletins>
    <title>Welcome</title>
    <id>1</id>
    <type>News</type>
</bulletins>

<bulletins>
    <title>Another News</title>
    <id>2</id>
    <type>Updates</type>
</bulletins>

Here is my current code : 这是我目前的代码:

<?php
$body = explode('<table border="0" cellpadding="3" cellspacing="0" bgcolor="#DDEEFF" width="100%">', $html);

$xml = simplexml_load_string("<?xml version='1.0' encoding='utf-8'?><xml />");

$rows = array();
foreach (array_slice(explode('<tr>', end($body)), 1) as $row)
{   
    preg_match('#<a.*?href="(.*?)".*?>(.*?)</a>#i', $row, $title);
    preg_match('/<a.*href="(.*)".*>(.*)<\/a>/iU', $row, $id);
   // preg_match('/type">([^<]+)<\/td>/', $row, $type); 

    $node = $xml->addChild('bulletins');

    $node->addChild('title', $title[1]);
    $node->addChild('id', $id[1]);
   // $node->addChild('type', $due[1]);
}

header('Content-Type: text/xml');
echo $xml->asXML();
?>

But the problem i got this 但问题是我得到了这个

<xml>
    <bulletins>
        <title>http://example.com/this/news?id=1</title>
        <id>http://example.com/this/news?id=1</id>
    </bulletins>
    <bulletins>
        <title>http://example.com/this/news?id=2</title>
        <id>http://example.com/this/news?id=2</id>
    </bulletins>
</xml>

The main problem is that the HTML is invalid to begin with, so most PHP XML/HTML parsers will fail when trying to parse out this particular HTML table. 主要问题是HTML开始时无效,因此大多数PHP XML / HTML解析器在尝试解析此特定HTML表时都会失败。 I used PHP Simple HTML DOM Parser to convert your table into JSON. 我使用PHP Simple HTML DOM Parser将您的表转换为JSON。 This code assumes that that table.html contains only the table you've specified above. 此代码假定table.html仅包含您在上面指定的表。

<?php
include 'simplehtmldom/simple_html_dom.php';

$html = file_get_html('table.html');

$row_count = 0;

foreach($html->find('tr') as $row) {
  $row_count++;
  foreach($html->find('td') as $cell) {
    $cell_text = $cell->plaintext;
    // There are no CSS classes or IDs to differentiate between columns in the
    // table, so we're using the width instead
    switch ($cell->attr['width']) {
      case '58%':
        $bulletins[$row_count]['title'] = $cell_text;
        break;
      case '17%':
        $bulletins[$row_count]['type'] = $cell_text;
        break;
    }
    $bulletins[$row_count]['id'] = $row_count;
  }
}

// Remove the invalid </a> tags from the cell text, and convert to JSON
$json = str_replace('<\/a>', '', json_encode($bulletins));

// Output:
// {"1":{"id":1,"title":"Another News","type":"Updates"},"2":{"id":2,"title":"Another     News","type":"Updates"}}
echo $json;

?>

Here's a quick example to get you started using only dom functions: 这是一个让您开始只使用dom函数的简单示例:

$dom = new DOMDocument();
@$dom->loadHTMLFile(url);
$xpath = new DOMXPath($dom);

$xml = new DOMDocument();
foreach($xpath->query('//table/tr') as $tr) {
  $bulletin = $xml->appendChild($xml->createElement("bulletin"));
  $title = $xpath->query('.//td[2]//a', $tr)->item(0)->nodeValue;
  $bulletin->appendChild($xml->createElement("title",$title));
  $type = $xpath->query('.//td[3]/font', $tr)->item(0)->nodeValue;
  $bulletin->appendChild($xml->createElement("type",$type));
}
echo $xml->saveXML();

I've written a script that will do this for you, assuming you have a valid HTML table: https://github.com/tremblay/HTML-Table-to-JSON 假设您有一个有效的HTML表格,我已经编写了一个可以为您执行此操作的脚本: https//github.com/tremblay/HTML-Table-to-JSON

With table formatted like that, you will run htmlToJSON('url.com', false, null, null, null, true); 对于那样格式化的表,您将运行htmlToJSON('url.com',false,null,null,null,true);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM