网页抓取时如何处理 HTTP 410？

Question

I can access a web site through my browser, for example:我可以通过浏览器访问网站，例如：

https://waset.org/conferences-in-february-2020-in-london https://waset.org/conferences-in-february-2020-in-london

... but if I try and web scrape this web site (I am using php simplehtmldom), I get an HTTP Error 410 (which means the page is gone, but it is there as I can see it through my browser). ...但是如果我尝试抓取这个网站（我使用的是 php simplehtmldom），我会收到一个 HTTP 错误 410（这意味着该页面已经消失，但它在那里，因为我可以通过浏览器看到它）。

Other web sites (from the same family, eg https://waset.org/conferences-in-february-2021-in-london ), I can scrape just fine.其他网站（来自同一个家族，例如https://waset.org/conferences-in-february-2021-in-london ），我可以很好地抓取。

Does anybody know why I get a 410, when the web page is there, and what I can do about it.有谁知道为什么我会收到 410，网页何时出现，以及我能做些什么。

Answer 1

You even can crawl it.你甚至可以抓取它。 Chrome also get 410 error code: Chrome 也会收到410错误代码：

Continue your stuff like if it was 200 code.继续你的东西，如果它是200代码。

-- Edit -- - 编辑 -

Look at this code it works well for your page :看看这段代码，它适用于您的页面：

$context = stream_context_create(array(
    'http' => array('ignore_errors' => true),
));

$result = file_get_contents('https://waset.org/conferences-in-february-2020-in-london', false, $context);

var_dump($result); 
// output <!DOCTYPE html> <html lang="en" dir="ltr" id="desktop"> <head> <!--Google Tag Manager -->...

We only choose to ignore errors, like our browser do automatically.我们只选择忽略错误，就像我们的浏览器自动做的那样。

Answer 2

When you load the page in a browser, the server responds with a 410 as well – see attached imag.当您在浏览器中加载页面时，服务器也会以410响应 - 请参阅附图。 They probably want to convey the message that the conferences are expired.他们可能想传达会议已过期的信息。

The rest of the data is loaded as expected...其余数据按预期加载...

网页抓取时如何处理 HTTP 410？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-02-16 14:28:42

解决方案2
0 2020-02-16 14:30:20

网页抓取时如何处理 HTTP 410？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-02-16 14:28:42

解决方案2 0 2020-02-16 14:30:20

解决方案1
1 已采纳 2020-02-16 14:28:42

解决方案2
0 2020-02-16 14:30:20