简体   繁体   English

用PHP将HTML表解析为数组-获取图像属性

[英]Parse HTML table into array with PHP- get image attributes

I'm trying to parse a table from an HTML webpage, but I'm having trouble. 我正在尝试从HTML网页解析表,但是遇到了麻烦。

Here is what my HTML approximately looks like : 这是我的HTML大致如下所示:

<tbody>

<tr class="even">
<td class="time">Monday 20:10</td>
<td class="place">Paris 14</td>
</tr>

<tr class="odd">
<td class="time">Monday 21:00</td>
<td class="place">Paris 13</td>
</tr>

</tbody>

EDIT : Here is my PHP 编辑:这是我的PHP

<?php

$url = 'https://www.gymsuedoise.com/loc/dt/?id=64';


$options = array(
    CURLOPT_RETURNTRANSFER => true,     // return web page
    CURLOPT_HEADER         => false,    // don't return headers
    CURLOPT_FOLLOWLOCATION => true,     // follow redirects
    CURLOPT_ENCODING       => "",       // handle all encodings
    CURLOPT_USERAGENT      => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0", // something like Firefox 
    CURLOPT_AUTOREFERER    => true,     // set referer on redirect
    CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
    CURLOPT_TIMEOUT        => 120,      // timeout on response
    CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
);

$curl = curl_init($url); curl_setopt_array( $curl, $options ); $content = curl_exec($curl); curl_close($curl);
$dom = new DOMDocument(); @$dom->loadHTML($content); $xpath = new DOMXPath($dom); 

$tables = $dom->getElementsByTagName('tbody'); 
$rows = $tables->item(0)->getElementsByTagName('tr');

foreach ($rows as $row) 
{ 

$cols = $row->getElementsByTagName('td'); 

$date = $cols->item(0)->nodeValue; $liste_element[$i]['date'] = trim($date);
$intensite = $cols->item(2)->nodeValue; $liste_element[$i]['intensite'] = trim($intensite);
$animateur = $cols->item(3)->nodeValue; $liste_element[$i]['animateur'] = trim($animateur);
$forfait = $cols->item(5)->nodeValue; $liste_element[$i]['forfait'] = trim($forfait);

$i++;
} 

echo '<pre>';            
print_r ($liste_element);
echo '<pre>';            

?>

My issue is that my script can't scrape anything in the 6th column (ie item(5) ) of the table, as there are only pictures and no text. 我的问题是我的脚本无法在表的第6列(即item(5) )中刮取任何内容,因为只有图片而没有文本。 How could I scrape the content in the alt or title attribute if the the <img> tag ? 如果<img>标签,我该如何在alttitle属性中抓取内容?

The error that you are getting is from printing/echoing an object. 您得到的错误是来自打印/回显对象。 Also you can't give a URL as an argument for loadHtml. 另外,您不能将URL作为loadHtml的参数。 You would need to do: 您需要执行以下操作:

$fetchHtml = file_get_contents($html);
$html = $dom->loadHTML($fetchHtml);

But doing it that way you are going to run into some issues with whoever's server you are tying to scrape. 但是这样做的话,您将遇到的问题是谁想要刮擦谁的服务器。

What I did was used a open sources PHP packaged called Guzzle . 我所做的是使用名为Guzzle的开源PHP打包程序

You will need to install it in your directory using composer . 您将需要使用composer将其安装在目录中。 To install composer just run: 要安装作曲家,只需运行:

curl -sS https://getcomposer.org/installer | php

Then open the composer.json file and put the following: 然后打开composer.json文件并放入以下内容:

{
   "require": {
      "guzzlehttp/guzzle": "~6.0"
   }
}

Then run: 然后运行:

composer update

This will create get all the dependencies you will need to run Guzzle. 这将创建获取运行Guzzle所需的所有依赖项。

If you are on shared hosting then download Guzzle and upload it to your server. 如果您使用共享主机,则下载Guzzle并将其上传到您的服务器。

github.com/guzzle/guzzle/releases github.com/guzzle/guzzle/releases

The new file will look like this: 新文件将如下所示:

<?php
require 'vendor/autoload.php';

$client = new GuzzleHttp\Client();
$dom = new DOMDocument();
$url = 'https://www.gymsuedoise.com/loc/dt/?id=64';

$res = $client->request('GET', $url, [
    'auth' => ['user', 'pass']
]);


$html = (string)$res->getBody();


// The @ in front of $dom will suppress any warnings
$domHtml = @$dom->loadHTML($html);

  //discard white space 
  $dom->preserveWhiteSpace = false;

  //the table by its tag name
  $tables = $dom->getElementsByTagName('tbody');


  //get all rows from the table
  $rows = $tables->item(0)->getElementsByTagName('tr');

  // loop over the table rows
  foreach ($rows as $row)
  {
   // get each column by tag name
      $cols = $row->getElementsByTagName('td');
   // echo the values  
      echo $cols->item(0)->nodeValue.'<br />';
      echo $cols->item(1)->nodeValue.'<br />';
      echo $cols->item(2)->nodeValue;
    }


?>

Keep in mind that this will only extract the first table from the html. 请记住,这只会从html中提取第一个表。

One way to do this is to utilize the method DOMElement::getAttribute() on the image. 一种方法是利用图像上的DOMElement :: getAttribute()方法。 To traverse the nodes down to the level of the image, utilize the DOMNode::$firstchild property of both the node, and then the anchor tag. 要遍历节点到图像级别,请使用两个节点的DOMNode :: $ firstchild属性,然后使用锚标记。 To ensure that the $firstChild won't be NULL , use DOMNode::hasChildNodes() . 为了确保$firstChild不会为NULL ,请使用DOMNode :: hasChildNodes()

if ($cols->item(5)->hasChildNodes()) {
    $anchor = $cols->item(5)->firstChild;
    if ($anchor->hasChildNodes()) {
        $altAttribute = $anchor->firstChild->getAttribute("alt"); 
        $liste_element[$i]['forfait'] = trim($altAttribute);
    }
}

For a demonstration, see this playground example . 有关演示,请参见此操场示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM