简体   繁体   中英

JSOUP / HTTP error fetching URL. Status=503

I am using JSOUB to scrape all the web page as the following:

   public static final String GOOGLE_SEARCH_URL = "https://www.google.com/search";

   String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num + 
      "&start=" + start;


    Document doc = Jsoup.connect(searchURL)
            .userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
           // .ignoreHttpErrors(true)
            .maxBodySize(1024*1024*3) 
            .followRedirects(true)
            .timeout(100000)
            .ignoreContentType(true)
            .get();


    Elements results = doc.select("h3.r > a");

      for (Element result : results) {

        String linkHref = result.attr("href");
     }

But my problem is that at the start of the code working good.

after a while, it will stop and always gives me " HTTP error fetching URL. Status=503 error".

when I add the .ignoreHttpErrors(true) it will work without any error but it will not scrape the web.

*search term is any keyword I want to search about and num is the number of pages that I need to retrieve.

could anyone help, please? Is this mean that Google blocked my IP from scraping? if yes is there any solution or how I scape the google search result, please?

I need help. Thank you,

503 error usually means the website you trying to scrap blocks you because they don't want non-human user navigating their sites. Especially Google.

There are something you can do though. Such as

  • Using proxy rotator
  • Use chromedriver
  • Add some delays to your application after each page

Basically you need to be as human as possible to prevent sites blocking you.

EDIT:

I need to warn you that scraping Google search result is against their ToS and might be illegal depends on where you are.

What you can do

You can use proxy rotating service to mask your request so google will see it as request from multiple region. Google proxy rotator service if you interested. It might be expensive depends on what you do with the data.

Then code some module that change the User-Agent every request to make Google less suspicious with your request.

Add random delay after scraping each page. I suggest around 1-5 seconds. Randomized delay makes your request more human-like for Google

At last if everything fails, you might want to look into Google search API and use their API instead of scraping their site.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM