简体   繁体   中英

Should I use proxies with simplexml_load_file and file_get_contents?

I'm using simplexml_load_file to get RSS from several websites for a while.

Sometimes I get errors from some of these websites and for about 5 days I'm having errors from 2 specific websites.

Here are the errors from simplexml_load_file :

PHP Warning:  simplexml_load_file(http://example.com/feed): failed to open stream: Connection timed out 

PHP Warning:  simplexml_load_file(): I/O warning : failed to load external entity "http://example.com/feed" 

Here are the errors from file_get_contents :

PHP Warning:  file_get_contents(http://example.com/page): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden

That's how I'm using simplexml_load_file :

simplexml_load_file( $url );

That's how I'm using file_get_contents :

file_get_contents( $url );

Is that because I'm not using a proxy or invalid arguments?

UPDATE: The 2 websites are using something like a firewall or a service to check for robots:

Accessing http://example.com/feed securely…
This is an automatic process. Your browser will redirect to your requested content in 5 seconds.

You're relying on an assumption that http://example.com/feed is always going to exist and always return exactly the content you're looking for. As you've discovered, this is a bad assumption.

You're attempting to access the network with your file_get_contents() and simplexml_load_file() and finding out that sometimes those call fail. You must always plan for these calls to fail. It doesn't matter if some websites openly allow this kind of behavior or if you have very reliable web host. There are circumstances out of your control, such as an Internet backbone outage, that will eventually cause your application to get back a bad response. In your situation, the third party has blocked you. This is one of the failures that happen with network requests.

The first take away is that you must handle the failure better . You cannot do this with file_get_contents() because file_get_contents() was designed to get the contents of files. In my opinion the PHP implementers that allowed it to make network calls made a very serious mistake allowing it this functionality. I'd recommend using curl:

function doRequest($url) {
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_TIMEOUT,10);
    $output = curl_exec($ch);
    $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if () {
        return $output;
    } else {
        throw new Exception('Sorry, an error occurred');
    }
}

Using this you will be able to handle errors (they will happen) better for your own users.

You're second problem is that this specific host is giving you a 403 error. This is probably intentional on their end. I would assume that this is them telling you that they don't want you using their website like this. However you will need to engage them specifically and ask them what you can do. They might ask you to use a real API, they might just ignore you entirely, they might even tell you to pound sand - but there isn't anything that we can do to advise here. This is strictly a problem (or feature) with their software and you must contact them directly for advice.

You could potentially use multiple IP addresses to connect to websites and rotate IPs each time one gets blocked. But doing so would be considered a malicious attack on their service.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM