[英]How do I deal with an HTTP 410 when web scraping?
I can access a web site through my browser, for example:我可以通过浏览器访问网站,例如:
https://waset.org/conferences-in-february-2020-in-london https://waset.org/conferences-in-february-2020-in-london
... but if I try and web scrape this web site (I am using php simplehtmldom), I get an HTTP Error 410 (which means the page is gone, but it is there as I can see it through my browser). ...但是如果我尝试抓取这个网站(我使用的是 php simplehtmldom),我会收到一个 HTTP 错误 410(这意味着该页面已经消失,但它在那里,因为我可以通过浏览器看到它)。
Other web sites (from the same family, eg https://waset.org/conferences-in-february-2021-in-london ), I can scrape just fine.其他网站(来自同一个家族,例如https://waset.org/conferences-in-february-2021-in-london ),我可以很好地抓取。
Does anybody know why I get a 410, when the web page is there, and what I can do about it.有谁知道为什么我会收到 410,网页何时出现,以及我能做些什么。
You even can crawl it.你甚至可以抓取它。 Chrome also get
410
error code: Chrome 也会收到
410
错误代码:
Continue your stuff like if it was 200
code.继续你的东西,如果它是
200
代码。
-- Edit -- - 编辑 -
Look at this code it works well for your page :看看这段代码,它适用于您的页面:
$context = stream_context_create(array(
'http' => array('ignore_errors' => true),
));
$result = file_get_contents('https://waset.org/conferences-in-february-2020-in-london', false, $context);
var_dump($result);
// output <!DOCTYPE html> <html lang="en" dir="ltr" id="desktop"> <head> <!--Google Tag Manager -->...
We only choose to ignore errors, like our browser do automatically.我们只选择忽略错误,就像我们的浏览器自动做的那样。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.