How to get specific content from cross-domain http request

Question

There is a Dutch news website at: nu.nl I am very interested in getting the first url headline which is resided over her:

<h3 class="hdtitle">
          <a style="" onclick="NU.AT.internalLink(this, event);" xtclib="position1_article_1" href="/buitenland/2880252/griekse-hotels-ontruimd-bosbranden.html">
            Griekse hotels ontruimd om bosbranden            <img src="/images/i18n/nl/slideshow/bt_fotograaf.png" class="vidlinkicon" alt="">          </a>
        </h3>

So my question is how do I get this url? Can I do this with Jquery? I would think not because it is not on my server. So maybe I would have to use PHP? Where do I start...?

Answer 1

Tested and working

Because http://www.nu.nl is not your site, you can do a cross-domain GET using the PHP proxy method, otherwise you will get this kind of error:

XMLHttpRequest cannot load http://www.nu.nl/ . Origin http://yourdomain.com is not allowed by Access-Control-Allow-Origin.

First of all use this file in your server at PHP side:

proxy.php (Updated)

<?php
if(isset($_GET['site'])){
  $f = fopen($_GET['site'], 'r');
  $html = '';
  while (!feof($f)) {
    $html .= fread($f, 24000);
  }
  fclose($f);
  echo $html;
}
?>

Now, at javascript side using jQuery you can do the following:

(Just to know I am using prop(); cause I use jQuery 1.7.2 version. So, if you are using a version before 1.6.x , try attr(); instead)

$(function(){

   var site = 'http://www.nu.nl';

   $.get('proxy.php', { site:site }, function(data){

      var href = $(data).find('.hdtitle').first().children(':first-child').prop('href');
      var url = href.split('/');
      href = href.replace(url[2], 'nu.nl');

      // Put the 'href' inside your div as a link
      $('#myDiv').html('<a href="' + href + '" target="_blank">' + href + '</a>');

   }, 'html');

});

As you can see, the request is in your domain but is a kind of tricky thing so you won't get the Access-Control-Allow-Origin error again!

Update

If you want to get all headlines href as you wrote in comments, you can do the following:

Just change jQuery code like this...

$(function(){

   var site = 'http://www.nu.nl';

   $.get('proxy.php', { site:site }, function(data){

        // get all html headlines
        headlines = $(data).find('.hdtitle');

        // get 'href' attribute of each headline and put it inside div
        headlines.map(function(elem, index){ 
            href = $(this).children(':first-child').prop('href');
            url = href.split('/');
            href = href.replace(url[2], 'nu.nl');
            $('#myDiv').append('<a href="' + href + '" target="_blank">' + href + '</a><br/>');
        });

   }, 'html');

});

and use updated proxy.php file (for both cases, 1 or all headlines).

Hope this helps :-)

Answer 2

You can use simplehtmldom library to get that link

Something like that

$html = file_get_html('website_link');
echo $html->getElementById("hdtitle")->childNodes(1)->getAttribute('href');

read more here

Answer 3

I would have suggested RSS, but unfortunately the headline you're looking for doesn't seem to appear there.

<?

$f = fopen('http://www.nu.nl', 'r');
$html = '';
while(strpos($html, 'position1_article_1') === FALSE)
    $html .= fread($f, 24000);
fclose($f);
$pos = strpos($html, 'position1_article_1');
$urlleft = substr($html, $pos + 27);
$url = substr($urlleft, 0, strpos($urlleft, '"'));
echo 'http://www.nu.nl' . $url;

?>

Outputs: http://www.nu.nl/buitenland/2880252/griekse-hotels-ontruimd-bosbranden.html

Answer 4

If you want to set up a jQuery bot to scrape the page through a browser (Google Chrome extensions allow for this functionality):

// print out the found anchor link's href attribute
console.log($('.hdtitle').find('a').attr('href'));

If you want to use PHP, you'll need to scrape the page for this href link. Use libraries such as SimpleTest to accomplish this. The best way to periodically scrape is to link your PHP script to a cronjob as well.

SimpleTest : http://www.lastcraft.com/browser_documentation.php

cronjob : http://net.tutsplus.com/tutorials/php/managing-cron-jobs-with-php-2/

Good luck!

Answer 5

Use cURL to retrieve the page. Then, use the following function to parse the string you've provided;

preg_match("/<a.*?href\=\"(.*?)\".*?>/is",$text,$matches);

The result URL will be in the $matches array.

How to get specific content from cross-domain http request

Question

5 answers

solution1
3 ACCPTED 2012-08-09 16:24:07

Tested and working

Update

solution2
1 2012-08-09 15:16:38

solution3
1 2012-08-09 15:18:29

solution4
0 2012-08-09 15:07:50

solution5
0 2012-08-09 15:14:23

How to get specific content from cross-domain http request

Question

5 answers

solution1 3 ACCPTED 2012-08-09 16:24:07

Tested and working

Update

solution2 1 2012-08-09 15:16:38

solution3 1 2012-08-09 15:18:29

solution4 0 2012-08-09 15:07:50

solution5 0 2012-08-09 15:14:23

solution1
3 ACCPTED 2012-08-09 16:24:07

solution2
1 2012-08-09 15:16:38

solution3
1 2012-08-09 15:18:29

solution4
0 2012-08-09 15:07:50

solution5
0 2012-08-09 15:14:23