简体   繁体   中英

Jsoup Google Search Results

I am attempting to parse the HTML of google's search results to grab the title of each result. This is done through android in a private nested class shown below:

private class WebScraper extends AsyncTask<String, Void, String> {

    public WebScraper() {}

    @Override
    protected String doInBackground(String... urls) {
        Document doc;
        try {
            doc = Jsoup.connect(urls[0]).get();
        } catch (IOException e) {
            System.out.println("Failed to open document");
            return "";
        }
        Elements results = doc.getElementsByClass("rc");
        int count = 0;
        for (Element lmnt : results) {
            System.out.println(count++);
            System.out.println(lmnt.text());
        }
        System.out.println("Count is : " + count);
        String key = "test";
        //noinspection Since15
        SearchActivity.this.songs.put(key, SearchActivity.this.songs.getOrDefault(key, 0) + 1);
        // return requested
        return "";
    }

}

an example url I am trying to parse: http://www.google.com/#q=i+might+site:genius.com

For some reason, when i run the above code, my count is printed as 0, thus no elements are being stored in results. Any help is much appreciated! PS docs is definitely initialized and the HTML page is loading properly

This code will search a word like "Apple" in google and fetch all links from results and display their title and url. It can search upto 500 words in a day after that google detect it and stop giving results.

    search="Apple"; //your word to be search on google
    String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; 
    Elements links=null;
    try {
          links = Jsoup.connect(google + 
                  URLEncoder.encode(search,charset)).
                  userAgent(userAgent).get().select(".g>.r>a");
        } catch (UnsupportedEncodingException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
        } catch (IOException e1) {
       // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    for (Element link : links) {
                String title = link.text();
                String url = link.absUrl("href"); // Google returns URLs in 
    format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
                try {
                    url = URLDecoder.decode(url.substring(url.indexOf('=') + 
    1, url.indexOf('&')), "UTF-8");
                } catch (UnsupportedEncodingException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }

                if (!url.startsWith("http")) {
                    continue; // Ads/news/etc.
                }

                System.out.println("Title: " + title);
                System.out.println("URL: " + url);


    }

If you check source code of the Google's page, you will notice that it does not contain any text data which is shown normally in the browser - there is only a bunch of javascript code. That means that Google outputs all the search results dynamically.

Jsoup will fetch that javascript code and it will not find any html code with "rc" classes, that's why you get zero count in your code sample.

Consider using Google's public search API instead of direct parsing of its html pages: https://developers.google.com/custom-search/ .

I completely agree with Matvey Sidorenko but for using the google public search API, you need to have the Google Api key . But the problem is that google limits 100 searches per api key , exceeding which, it stops working and it gets reset in 24 hours.

Recently i was working on a project where we needed to get the google search result links for different queries provided by the user, so as to overcome this issue of API limit, i made my own API that searches directly on google/ncr and gives you the result link.

Free Google Search API- http://freegoogleapi.azurewebsites.net/ OR http://google.bittque.com

I used HTML-UNIT library for making this API.

You can use my API or you can use the HTML UNIT Library for achieving what you need.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM