简体   繁体   English

使用DOMXPath查询方法抓取网站时,如何解决xPath丢失问题并保持数据统一?

[英]How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

I am attempting to scrape a website using the DOMXPath query method. 我正在尝试使用DOMXPath查询方法来抓取网站 I have successfully scraped the 20 profile URLs of each News Anchor from this page. 我已经成功地从该页面抓取了每个新闻主播的20个配置文件URL。

$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n){
    $value = $n->nodeValue;
    $profileurl[] = $value;

    }

I used the resulting array as the URL to scrape data from each of the News Anchor's bio pages. 我使用结果数组作为URL来从News Anchor的每个生物页面中抓取数据。

$imgurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//img[@class='photo fn']/@src");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $imgurl[] = $value;
        }
    }

Each News Anchor profile page has 6 xPaths I need to scrape (the $imgurl array is one of them). 每个News Anchor个人资料页面都有6个我需要抓取的xPath($ imgurl数组就是其中之一)。 I am then sending this scraped data to MySQL. 然后,我将这些抓取的数据发送到MySQL。

So far, everything works great - except when I attempt to get the Twitter URL from each profile because this element isn't found on every News Anchor profile page. 到目前为止,一切工作都很好- 除非我尝试从每个配置文件获取Twitter URL,因为在每个News Anchor配置文件页面上都找不到此元素。 This results in MySQL receiving 5 columns with 20 full rows and 1 column (twitterurl) with 18 rows of data. 这导致MySQL接收5列20个完整行和1列(twitterurl)包含18行数据。 Those 18 rows are not lined up with the other data correctly because if the xPath doesn't exist, it seems to be skipped. 那18行未与其他数据正确对齐,因为如果xPath不存在,它似乎会被跳过。

How do I account for missing xPaths? 如何解决缺少的xPath? Looking for an answer, I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." 在寻找答案时,我发现有人说:“ nodeValue永远不能为null,因为没有值,则该节点将不存在。” That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration? 考虑到这一点,如果没有nodeValue,我如何以编程方式识别这些xPath不存在的情况,并在循环到下一个迭代之前用其他一些默认值填充该迭代?

Here's the query for the Twitter URLs: 这是Twitter URL的查询:

$twitterurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//*[@id='bio']/div[2]/p[3]/a/@href");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $twitterurl[] = $value;
        }
    }

Since the twitter node appears zero or one times, change the foreach to 由于twitter节点出现0或1次,因此将foreach更改为

$twitterurl [] = $nodelist->length ? $nodelist->item(0)->nodeValue : NULL;

That will keep the contents in sync. 这样可以使内容保持同步。 You will, however, have to make arrangements to handle NULL values in the query you use to insert them in the database. 但是,您将必须安排在用于将NULL值插入数据库中的查询中处理NULL值。

I think you have multiple issues in the way you scrape the data and will try to outline those in my answer in the hope it always clarifies your central question: 我认为您在收集数据的方式上存在多个问题,并会尝试在我的答案中概述这些问题,希望它能始终阐明您的核心问题:

I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." 我发现有人说“ nodeValue永远不能为null,因为如果没有值,则该节点将不存在。” That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration? 考虑到这一点,如果没有nodeValue,我如何以编程方式识别这些xPath不存在的情况,并在循环到下一个迭代之前用其他一些默认值填充该迭代?

First of all collecting the URLs of each profile (detail) page is a good idea. 首先收集每个配置文件(详细信息)页面的URL是一个好主意。 You can even benefit more from it by putting this into the overall context of your scraping job: 通过将其纳入您的抓取工作的整体环境,您甚至可以从中受益更多:

* profile pages
     `- profile page
          +- name
          +- role
          +- img
          +- email
          +- facebook
          `- twitter

This is the structure you have with the data you like to obtain. 这就是您要获取的数据的结构。 You already managed to obtain all profile pages URLs: 您已经设法获取所有个人资料页面的URL:

$url   = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath    = new DOMXPath($html);
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n) {
    $value        = $n->nodeValue;
    $profileurl[] = $value;
}

As you know that the next steps would be to load and query the 20+ profile pages, one of the very first things you could do is to extract the part of your code that is creating a DOMXPath from an URL into a function of it's own. 如您所知,下一步将是加载和查询20多个配置文件页面,您可以做的第一件事就是将代码中从URL创建DOMXPath的部分提取为自身的功能。 This will also allow you to do better error handling easily: 这也将使您轻松进行更好的错误处理:

/**
 * @param string $url
 *
 * @throws RuntimeException
 * @return DOMXPath
 */
function xpath_from_url($url)
{
    $html   = new DOMDocument();
    $saved  = libxml_use_internal_errors(true);
    $result = $html->loadHtmlFile($url);
    libxml_use_internal_errors($saved);
    if (!$result) {
        throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
    }
    $xpath = new DOMXPath($html);
    return $xpath;
}

This changes the main processing into a more compressed form then only by the extraction (move) of the code into the xpath_from_url function: 这样,仅通过将代码提取(移动)到xpath_from_url函数中,就可以将主要处理更改为压缩形式:

$xpath    = xpath_from_url($url);
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n) {
    $value        = $n->nodeValue;
    $profileurl[] = $value;
}

But it does also allow you another change to the code: You can now process the URLs directly in the structure of your main extraction routine: 但这还允许您对代码进行另一处更改:现在,您可以直接在主提取例程的结构中处理URL:

$url = "http://www.sandiego6.com/about-us/meet-our-team";

$xpath       = xpath_from_url($url);
$profileUrls = $xpath->query("//p[@class='bio']/a/@href");
foreach ($profileUrls as $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    // ... extract the six (inkl. optional) values from a profile
}

As you can see, this code skips creating the array of profile-URLs because a collection of all profile-URLs are already given by the first xpath operation. 如您所见,由于第一个xpath操作已经给出了所有配置文件URL的集合,因此该代码将跳过创建配置文件URL的数组。

Now there is the part missing to extract the up to six fields from the detail page. 现在缺少从详细信息页面中提取最多六个字段的部分。 With this new way to iterate over the profile URLs, this is pretty easy to manage - just create one xpath expression for each field and fetch the data. 使用这种遍历配置文件URL的新方法,这非常易于管理-只需为每个字段创建一个xpath表达式并获取数据。 If you make use of DOMXPath::evaluate instead of DOMXPath::query then you can get string values directly. 如果您使用DOMXPath::evaluate而不是DOMXPath::query则可以直接获取字符串值。 The string-value of a non-existing node, is an empty string. 不存在的节点的字符串值是一个空字符串。 This is not really testing if the node exists or not, in case you need NULL instead of "" (empty string), this needs to be done differently (I can show that, too, but that's not the point right now). 如果您需要NULL而不是“”(空字符串),那么这实际上不是在测试节点是否存在,这需要以不同的方式进行(我也可以证明这一点,但是现在不是重点)。 In the following example the anchors name and role is being extracted: 在以下示例中,将提取锚点名称和角色:

foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1,
        $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    // ... extract the other four (inkl. optional) values from a profile
}

I choose to directly output the values (and not care about adding them into an array or a similar structure), so that it's easy to follow what happens: 我选择直接输出值(而不关心将它们添加到数组或类似结构中),因此很容易了解发生的情况:

#01: Marc Bailey (Morning Anchor)
#02: Heather Myers (Morning Anchor)
#03: Jim Patton (10pm Anchor)
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
...

Fetching the details about email, facebook and twitter works the same: 获取有关电子邮件,Facebook和Twitter的详细信息的方式相同:

foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1,
        $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    printf(
        "  email...: %s\n",
        $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")')
    );
    printf(
        "  facebook: %s\n",
        $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)')
    );
    printf(
        "  twitter.: %s\n",
        $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)')
    );
}

This now already outputs the data as you need it (I've left the images out because those can't be well displayed in text-mode: 现在,它已经可以根据需要输出数据了(我省略了图像,因为这些图像无法在文本模式下很好地显示:

#01: Marc Bailey (Morning Anchor)
  email...: m.bailey@sandiego6.com
  facebook: https://www.facebook.com/marc.baileySD6
  twitter.: http://www.twitter.com/MarcBaileySD6
#02: Heather Myers (Morning Anchor)
  email...: heather.myers@sandiego6.com
  facebook: https://www.facebook.com/heather.myersSD6
  twitter.: http://www.twitter.com/HeatherMyersSD6
#03: Jim Patton (10pm Anchor)
  email...: jim.patton@sandiego6.com
  facebook: https://www.facebook.com/Jim.PattonSD6
  twitter.: http://www.twitter.com/JimPattonSD6
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
  email...: Neda.Iranpour@sandiego6.com
  facebook: https://www.facebook.com/lightenupwithneda
  twitter.: http://www.twitter.com/@LightenUpWNeda
...

So now these little lines of code with one foreach loop already fairly well represent the original structure outlined: 因此,现在这些带有一个foreach循环的小代码行已经很好地代表了所概述的原始结构:

* profile pages
     `- profile page
          +- name
          +- role
          +- img
          +- email
          +- facebook
          `- twitter

All you have to do is just to follow that overall structure of how the data is available with your code. 您要做的只是遵循代码中数据的整体结构。 Then at the end when you see that all data can be obtained as wished, you do the store operation in the database: one insert per profile. 然后最后,当您看到可以按需获取所有数据时,就可以在数据库中进行存储操作:每个概要文件插入一个。 that is one row per profile. 每个配置文件一行。 you don't have to keep the whole data, you can just insert (perhaps with some check if it already exists) the data for each row. 您不必保留整个数据,您只需插入(也许需要检查一下是否已经存在)每一行的数据即可。

Hope that helps. 希望能有所帮助。


Appendix: Code in full 附录:完整代码

<?php
/**
 * Scraping detail pages based on index page
 */

/**
 * @param string $url
 *
 * @throws RuntimeException
 * @return DOMXPath
 */
function xpath_from_url($url)
{
    $html   = new DOMDocument();
    $saved  = libxml_use_internal_errors(true);
    $result = $html->loadHtmlFile($url);
    libxml_use_internal_errors($saved);
    if (!$result) {
        throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
    }
    $xpath = new DOMXPath($html);
    return $xpath;
}

$url = "http://www.sandiego6.com/about-us/meet-our-team";

$xpath       = xpath_from_url($url);
$profileUrls = $xpath->query("//p[@class='bio']/a/@href");
foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1, $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    printf("  email...: %s\n", $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")'));
    printf("  facebook: %s\n", $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)'));
    printf("  twitter.: %s\n", $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)'));
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM