简体   繁体   中英

jsoup - how to check if a webpage exist or not

Hi stackoverflow users.

When i was doing web scraping, i encountered a problem that, when i scrape through a series of webpages of a particular site, with their URLs being

http://www.somewebsites.com/abc.php?number=0001
http://www.somewebsites.com/abc.php?number=0002
http://www.somewebsites.com/abc.php?number=0003
..
..
http://www.somewebsites.com/abc.php?number=1234

Something like this. Since some of the pages may be occasionally down and the server may handle it by redirecting to a different page, say the homepage. In this way, my scraping program will encounter various exceptions related to the change in syntax structure ( as it is a different page).

I'm wondering if there is a way to check whether a webpage i'm scraping exists or not, to prevent my program from being terminated in this case.

I'm using

Jsoup.connect()

to connect to that page. However, when i visit the failed webpage ( redirected ), i was redirected to another page. In my program, the console do not throw any exception about the connect. Instead, the exception is just an index out of bound exception because the unexpected redirected webpage has a totally different structure.

Since some of the pages may be occasionally down and the server may handle it by redirecting to a different page, say the homepage

In general, when a page on a website is not temporarily available and gets redirected, the client gets the response code as 302 (moved permanetly) or 307 (moved temporarily) with "Location" header that points to the redirected page. It seems you can configure the Connection to not redirect in such cases, by setting the followRedirects to false. Then you can verify the HTTP response code before converting the response to Document for further processing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM