简体   繁体   中英

Jsoup on imdb 403 error

I need to parse the imdb page in order to get results displayed. I am using Jsoup for this purpose. Below is the code that I wrote for this purpose. When I run the code, I see a 403 error. I re-verified the url and the url seems to be right.

import java.io.IOException;
import java.net.URLEncoder;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class ParseIMDB {

    public static void parse() throws IOException{
        Document doc = Jsoup.connect("http://imdb.com/search/title?count=100&genres=action&languages=en&release_date=2010,2016&title_type=feature").get();
        Elements newsHeadlines = doc.select("#main > table.results tbody");
    }

    public static void main(String[] args) {
        // TODO Auto-generated method stub
        try {
        parse();
        } catch (Exception e){
            System.out.println("Exception found!");
            e.printStackTrace();
        }
    }
}

I tried encoding the url using URLEncode.encode but it dint help either.

The stack trace for the above code is as below:

Exception found! org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL= http://www.imdb.com/search/title/ at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:534) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:194) at ParseIMDB.parse(ParseIMDB.java:13) at ParseIMDB.main(ParseIMDB.java:20)

I believe it'll work if you add a User-Agent header to the request. You can do that like this:

 Document doc = Jsoup.connect("http://imdb.com/search/title?count=100&genres=action&languages=en&release_date=2010,2016&title_type=feature")
                .userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36")
                .get();

This solution has been tested and works, returning a list of movies.

HTTP 403 means Forbidden . Most likely imdb is blocking programmatic requests.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM