简体   繁体   English

jsoup-如何检查网页是否存在

[英]jsoup - how to check if a webpage exist or not

Hi stackoverflow users. 嗨stackoverflow用户。

When i was doing web scraping, i encountered a problem that, when i scrape through a series of webpages of a particular site, with their URLs being 当我进行网页抓取时,遇到一个问题,当我抓取特定站点的一系列网页时,它们的URL为

http://www.somewebsites.com/abc.php?number=0001
http://www.somewebsites.com/abc.php?number=0002
http://www.somewebsites.com/abc.php?number=0003
..
..
http://www.somewebsites.com/abc.php?number=1234

Something like this. 这样的事情。 Since some of the pages may be occasionally down and the server may handle it by redirecting to a different page, say the homepage. 由于某些页面有时可能会关闭,因此服务器可以通过重定向到其他页面来处理它,例如首页。 In this way, my scraping program will encounter various exceptions related to the change in syntax structure ( as it is a different page). 这样,我的抓取程序将遇到与语法结构更改相关的各种异常(因为它是不同的页面)。

I'm wondering if there is a way to check whether a webpage i'm scraping exists or not, to prevent my program from being terminated in this case. 我想知道是否有一种方法可以检查我正在抓取的网页是否存在,以防止我的程序在这种情况下被终止。

I'm using 我正在使用

Jsoup.connect()

to connect to that page. 连接到该页面。 However, when i visit the failed webpage ( redirected ), i was redirected to another page. 但是,当我访问失败的网页(重定向)时,我被重定向到另一个页面。 In my program, the console do not throw any exception about the connect. 在我的程序中,控制台不会引发有关连接的任何异常。 Instead, the exception is just an index out of bound exception because the unexpected redirected webpage has a totally different structure. 相反,该异常只是超出范围的异常的索引,因为意外重定向的网页具有完全不同的结构。

Since some of the pages may be occasionally down and the server may handle it by redirecting to a different page, say the homepage 由于某些页面可能偶尔会关闭,服务器可能会通过重定向到其他页面来处理它,因此请说主页

In general, when a page on a website is not temporarily available and gets redirected, the client gets the response code as 302 (moved permanetly) or 307 (moved temporarily) with "Location" header that points to the redirected page. 通常,当网站上的页面暂时不可用并被重定向时,客户端将获得响应代码为302(永久移动)或307(临时移动),并且带有“ Location”标头,该代码指向重定向的页面。 It seems you can configure the Connection to not redirect in such cases, by setting the followRedirects to false. 似乎可以通过将followRedirects设置为false来配置Connection在这种情况下不重定向。 Then you can verify the HTTP response code before converting the response to Document for further processing. 然后,您可以在将响应转换为文档以进行进一步处理之前,验证HTTP响应代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM