使用DOMXPath查询方法抓取网站时，如何解决xPath丢失问题并保持数据统一？

Question

我正在尝试使用DOMXPath查询方法来抓取网站。 我已经成功地从该页面抓取了每个新闻主播的20个配置文件URL。

$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n){
    $value = $n->nodeValue;
    $profileurl[] = $value;

    }

我使用结果数组作为URL来从News Anchor的每个生物页面中抓取数据。

$imgurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//img[@class='photo fn']/@src");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $imgurl[] = $value;
        }
    }

每个News Anchor个人资料页面都有6个我需要抓取的xPath（$ imgurl数组就是其中之一）。 然后，我将这些抓取的数据发送到MySQL。

到目前为止，一切工作都很好- 除非我尝试从每个配置文件获取Twitter URL，因为在每个News Anchor配置文件页面上都找不到此元素。 这导致MySQL接收5列20个完整行和1列（twitterurl）包含18行数据。 那18行未与其他数据正确对齐，因为如果xPath不存在，它似乎会被跳过。

如何解决缺少的xPath？ 在寻找答案时，我发现有人说：“ nodeValue永远不能为null，因为没有值，则该节点将不存在。” 考虑到这一点，如果没有nodeValue，我如何以编程方式识别这些xPath不存在的情况，并在循环到下一个迭代之前用其他一些默认值填充该迭代？

这是Twitter URL的查询：

$twitterurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//*[@id='bio']/div[2]/p[3]/a/@href");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $twitterurl[] = $value;
        }
    }

Answer 1

由于twitter节点出现0或1次，因此将foreach更改为

$twitterurl [] = $nodelist->length ? $nodelist->item(0)->nodeValue : NULL;

这样可以使内容保持同步。 但是，您将必须安排在用于将NULL值插入数据库中的查询中处理NULL值。

Answer 2

我认为您在收集数据的方式上存在多个问题，并会尝试在我的答案中概述这些问题，希望它能始终阐明您的核心问题：

我发现有人说“ nodeValue永远不能为null，因为如果没有值，则该节点将不存在。” 考虑到这一点，如果没有nodeValue，我如何以编程方式识别这些xPath不存在的情况，并在循环到下一个迭代之前用其他一些默认值填充该迭代？

首先收集每个配置文件（详细信息）页面的URL是一个好主意。 通过将其纳入您的抓取工作的整体环境，您甚至可以从中受益更多：

* profile pages
     `- profile page
          +- name
          +- role
          +- img
          +- email
          +- facebook
          `- twitter

这就是您要获取的数据的结构。 您已经设法获取所有个人资料页面的URL：

$url   = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath    = new DOMXPath($html);
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n) {
    $value        = $n->nodeValue;
    $profileurl[] = $value;
}

如您所知，下一步将是加载和查询20多个配置文件页面，您可以做的第一件事就是将代码中从URL创建DOMXPath的部分提取为自身的功能。 这也将使您轻松进行更好的错误处理：

/**
 * @param string $url
 *
 * @throws RuntimeException
 * @return DOMXPath
 */
function xpath_from_url($url)
{
    $html   = new DOMDocument();
    $saved  = libxml_use_internal_errors(true);
    $result = $html->loadHtmlFile($url);
    libxml_use_internal_errors($saved);
    if (!$result) {
        throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
    }
    $xpath = new DOMXPath($html);
    return $xpath;
}

这样，仅通过将代码提取（移动）到xpath_from_url函数中，就可以将主要处理更改为压缩形式：

$xpath    = xpath_from_url($url);
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n) {
    $value        = $n->nodeValue;
    $profileurl[] = $value;
}

但这还允许您对代码进行另一处更改：现在，您可以直接在主提取例程的结构中处理URL：

$url = "http://www.sandiego6.com/about-us/meet-our-team";

$xpath       = xpath_from_url($url);
$profileUrls = $xpath->query("//p[@class='bio']/a/@href");
foreach ($profileUrls as $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    // ... extract the six (inkl. optional) values from a profile
}

如您所见，由于第一个xpath操作已经给出了所有配置文件URL的集合，因此该代码将跳过创建配置文件URL的数组。

现在缺少从详细信息页面中提取最多六个字段的部分。 使用这种遍历配置文件URL的新方法，这非常易于管理-只需为每个字段创建一个xpath表达式并获取数据。 如果您使用DOMXPath::evaluate而不是DOMXPath::query则可以直接获取字符串值。 不存在的节点的字符串值是一个空字符串。 如果您需要NULL而不是“”（空字符串），那么这实际上不是在测试节点是否存在，这需要以不同的方式进行（我也可以证明这一点，但是现在不是重点）。 在以下示例中，将提取锚点名称和角色：

foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1,
        $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    // ... extract the other four (inkl. optional) values from a profile
}

我选择直接输出值（而不关心将它们添加到数组或类似结构中），因此很容易了解发生的情况：

#01: Marc Bailey (Morning Anchor)
#02: Heather Myers (Morning Anchor)
#03: Jim Patton (10pm Anchor)
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
...

获取有关电子邮件，Facebook和Twitter的详细信息的方式相同：

foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1,
        $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    printf(
        "  email...: %s\n",
        $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")')
    );
    printf(
        "  facebook: %s\n",
        $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)')
    );
    printf(
        "  twitter.: %s\n",
        $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)')
    );
}

现在，它已经可以根据需要输出数据了（我省略了图像，因为这些图像无法在文本模式下很好地显示：

#01: Marc Bailey (Morning Anchor)
  email...: m.bailey@sandiego6.com
  facebook: https://www.facebook.com/marc.baileySD6
  twitter.: http://www.twitter.com/MarcBaileySD6
#02: Heather Myers (Morning Anchor)
  email...: heather.myers@sandiego6.com
  facebook: https://www.facebook.com/heather.myersSD6
  twitter.: http://www.twitter.com/HeatherMyersSD6
#03: Jim Patton (10pm Anchor)
  email...: jim.patton@sandiego6.com
  facebook: https://www.facebook.com/Jim.PattonSD6
  twitter.: http://www.twitter.com/JimPattonSD6
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
  email...: Neda.Iranpour@sandiego6.com
  facebook: https://www.facebook.com/lightenupwithneda
  twitter.: http://www.twitter.com/@LightenUpWNeda
...

因此，现在这些带有一个foreach循环的小代码行已经很好地代表了所概述的原始结构：

* profile pages
     `- profile page
          +- name
          +- role
          +- img
          +- email
          +- facebook
          `- twitter

您要做的只是遵循代码中数据的整体结构。 然后最后，当您看到可以按需获取所有数据时，就可以在数据库中进行存储操作：每个概要文件插入一个。 每个配置文件一行。 您不必保留整个数据，您只需插入（也许需要检查一下是否已经存在）每一行的数据即可。

希望能有所帮助。

附录：完整代码

<?php
/**
 * Scraping detail pages based on index page
 */

/**
 * @param string $url
 *
 * @throws RuntimeException
 * @return DOMXPath
 */
function xpath_from_url($url)
{
    $html   = new DOMDocument();
    $saved  = libxml_use_internal_errors(true);
    $result = $html->loadHtmlFile($url);
    libxml_use_internal_errors($saved);
    if (!$result) {
        throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
    }
    $xpath = new DOMXPath($html);
    return $xpath;
}

$url = "http://www.sandiego6.com/about-us/meet-our-team";

$xpath       = xpath_from_url($url);
$profileUrls = $xpath->query("//p[@class='bio']/a/@href");
foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1, $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    printf("  email...: %s\n", $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")'));
    printf("  facebook: %s\n", $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)'));
    printf("  twitter.: %s\n", $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)'));
}

使用DOMXPath查询方法抓取网站时，如何解决xPath丢失问题并保持数据统一？

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-10-14 20:48:40

解决方案2
1 2014-10-19 09:26:16

使用DOMXPath查询方法抓取网站时，如何解决xPath丢失问题并保持数据统一？

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-10-14 20:48:40

解决方案2 1 2014-10-19 09:26:16

解决方案1
1 已采纳 2014-10-14 20:48:40

解决方案2
1 2014-10-19 09:26:16