简体   繁体   中英

Jsoup connect(): bypass google captcha

I make a small application and I have to retrieve the URL based on keywords. This is the code:

  Elements doc = Jsoup
        .connect(request)
        .userAgent(
          "Mozilla 5.0 (Windows NT 6.1)")
        .timeout(5000).get().select("li.g>h3>a");


        for (Element link : doc) {

              String url = link.absUrl("href"); 
            try {
              url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
            } catch (UnsupportedEncodingException e) {
                    // TODO Auto-generated catch block
              e.printStackTrace();
            }



            if(!url.startsWith("http")) 
                continue; // Ads/news/etc.
            else if(url.contains("/pdf/"))
                continue;
            else if(url.contains("//github.com/"))
                continue;


            res.add(url);
        }

just get the following error:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://ipv4.google.com/sorry/IndexRedirect?continue=http://www.google.com/search%3Flr%3Dlang_en....
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:435)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:446)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at sperimentazioni.Main.getDataFromGoogle(Main.java:327)
at sperimentazioni.Main.getURLs(Main.java:164)
at sperimentazioni.Main.main(Main.java:485)

Apparently it is the captcha google, how can I bypass?

The following logic works for me:

Document doc =
    Jsoup.connect(request)
         .userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
         .timeout(5000).get();

Elements links = doc.select("a[href]");
for (Element link : links) {

    String temp = link.attr("href");
    if (temp.startsWith("/url?q=")) 
        result.add(temp);

}

You cannot bypass it, however you can use 3rd party services for CPATCHA recognition and post proper answer. Check DeatchByCaptcha.com

解决此问题的唯一方法是向用户显示验证码,然后从响应标头保存cookie。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM