简体   繁体   中英

How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

I am attempting to scrape a website using the DOMXPath query method. I have successfully scraped the 20 profile URLs of each News Anchor from this page.

$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n){
    $value = $n->nodeValue;
    $profileurl[] = $value;

    }

I used the resulting array as the URL to scrape data from each of the News Anchor's bio pages.

$imgurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//img[@class='photo fn']/@src");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $imgurl[] = $value;
        }
    }

Each News Anchor profile page has 6 xPaths I need to scrape (the $imgurl array is one of them). I am then sending this scraped data to MySQL.

So far, everything works great - except when I attempt to get the Twitter URL from each profile because this element isn't found on every News Anchor profile page. This results in MySQL receiving 5 columns with 20 full rows and 1 column (twitterurl) with 18 rows of data. Those 18 rows are not lined up with the other data correctly because if the xPath doesn't exist, it seems to be skipped.

How do I account for missing xPaths? Looking for an answer, I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?

Here's the query for the Twitter URLs:

$twitterurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//*[@id='bio']/div[2]/p[3]/a/@href");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $twitterurl[] = $value;
        }
    }

Since the twitter node appears zero or one times, change the foreach to

$twitterurl [] = $nodelist->length ? $nodelist->item(0)->nodeValue : NULL;

That will keep the contents in sync. You will, however, have to make arrangements to handle NULL values in the query you use to insert them in the database.

I think you have multiple issues in the way you scrape the data and will try to outline those in my answer in the hope it always clarifies your central question:

I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?

First of all collecting the URLs of each profile (detail) page is a good idea. You can even benefit more from it by putting this into the overall context of your scraping job:

* profile pages
     `- profile page
          +- name
          +- role
          +- img
          +- email
          +- facebook
          `- twitter

This is the structure you have with the data you like to obtain. You already managed to obtain all profile pages URLs:

$url   = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath    = new DOMXPath($html);
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n) {
    $value        = $n->nodeValue;
    $profileurl[] = $value;
}

As you know that the next steps would be to load and query the 20+ profile pages, one of the very first things you could do is to extract the part of your code that is creating a DOMXPath from an URL into a function of it's own. This will also allow you to do better error handling easily:

/**
 * @param string $url
 *
 * @throws RuntimeException
 * @return DOMXPath
 */
function xpath_from_url($url)
{
    $html   = new DOMDocument();
    $saved  = libxml_use_internal_errors(true);
    $result = $html->loadHtmlFile($url);
    libxml_use_internal_errors($saved);
    if (!$result) {
        throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
    }
    $xpath = new DOMXPath($html);
    return $xpath;
}

This changes the main processing into a more compressed form then only by the extraction (move) of the code into the xpath_from_url function:

$xpath    = xpath_from_url($url);
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n) {
    $value        = $n->nodeValue;
    $profileurl[] = $value;
}

But it does also allow you another change to the code: You can now process the URLs directly in the structure of your main extraction routine:

$url = "http://www.sandiego6.com/about-us/meet-our-team";

$xpath       = xpath_from_url($url);
$profileUrls = $xpath->query("//p[@class='bio']/a/@href");
foreach ($profileUrls as $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    // ... extract the six (inkl. optional) values from a profile
}

As you can see, this code skips creating the array of profile-URLs because a collection of all profile-URLs are already given by the first xpath operation.

Now there is the part missing to extract the up to six fields from the detail page. With this new way to iterate over the profile URLs, this is pretty easy to manage - just create one xpath expression for each field and fetch the data. If you make use of DOMXPath::evaluate instead of DOMXPath::query then you can get string values directly. The string-value of a non-existing node, is an empty string. This is not really testing if the node exists or not, in case you need NULL instead of "" (empty string), this needs to be done differently (I can show that, too, but that's not the point right now). In the following example the anchors name and role is being extracted:

foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1,
        $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    // ... extract the other four (inkl. optional) values from a profile
}

I choose to directly output the values (and not care about adding them into an array or a similar structure), so that it's easy to follow what happens:

#01: Marc Bailey (Morning Anchor)
#02: Heather Myers (Morning Anchor)
#03: Jim Patton (10pm Anchor)
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
...

Fetching the details about email, facebook and twitter works the same:

foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1,
        $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    printf(
        "  email...: %s\n",
        $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")')
    );
    printf(
        "  facebook: %s\n",
        $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)')
    );
    printf(
        "  twitter.: %s\n",
        $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)')
    );
}

This now already outputs the data as you need it (I've left the images out because those can't be well displayed in text-mode:

#01: Marc Bailey (Morning Anchor)
  email...: m.bailey@sandiego6.com
  facebook: https://www.facebook.com/marc.baileySD6
  twitter.: http://www.twitter.com/MarcBaileySD6
#02: Heather Myers (Morning Anchor)
  email...: heather.myers@sandiego6.com
  facebook: https://www.facebook.com/heather.myersSD6
  twitter.: http://www.twitter.com/HeatherMyersSD6
#03: Jim Patton (10pm Anchor)
  email...: jim.patton@sandiego6.com
  facebook: https://www.facebook.com/Jim.PattonSD6
  twitter.: http://www.twitter.com/JimPattonSD6
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
  email...: Neda.Iranpour@sandiego6.com
  facebook: https://www.facebook.com/lightenupwithneda
  twitter.: http://www.twitter.com/@LightenUpWNeda
...

So now these little lines of code with one foreach loop already fairly well represent the original structure outlined:

* profile pages
     `- profile page
          +- name
          +- role
          +- img
          +- email
          +- facebook
          `- twitter

All you have to do is just to follow that overall structure of how the data is available with your code. Then at the end when you see that all data can be obtained as wished, you do the store operation in the database: one insert per profile. that is one row per profile. you don't have to keep the whole data, you can just insert (perhaps with some check if it already exists) the data for each row.

Hope that helps.


Appendix: Code in full

<?php
/**
 * Scraping detail pages based on index page
 */

/**
 * @param string $url
 *
 * @throws RuntimeException
 * @return DOMXPath
 */
function xpath_from_url($url)
{
    $html   = new DOMDocument();
    $saved  = libxml_use_internal_errors(true);
    $result = $html->loadHtmlFile($url);
    libxml_use_internal_errors($saved);
    if (!$result) {
        throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
    }
    $xpath = new DOMXPath($html);
    return $xpath;
}

$url = "http://www.sandiego6.com/about-us/meet-our-team";

$xpath       = xpath_from_url($url);
$profileUrls = $xpath->query("//p[@class='bio']/a/@href");
foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1, $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    printf("  email...: %s\n", $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")'));
    printf("  facebook: %s\n", $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)'));
    printf("  twitter.: %s\n", $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)'));
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM