简体   繁体   中英

Issues using Jsoup to connect to a webpage

This is my first time using JSoup, and I am having an issue connecting to a url that I want to parse information out of.

The url: http://uselectionatlas.org/RESULTS/national.php?f=1&year=2008&off=0&elect=0

I originally tried to do this, however I was getting a timeout exception

    Document doc = Jsoup.connect("http://uselectionatlas.org/RESULTS/national.php?f=1&year=2008&off=0&elect=0").get();

Here is the exception:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:152)
    at java.net.SocketInputStream.read(SocketInputStream.java:122)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1324)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:575)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:224)
    at ParseData.main(ParseData.java:18)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

I did some research online and I found a method .timeout(0) which sets the Jsoup timeout to infinite.

Now when I try this

            Document doc = Jsoup.connect("http://uselectionatlas.org/RESULTS/national.php?f=1&year=2008&off=0&elect=0").timeout(0).get();

I get the following exception:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://uselectionatlas.org/RESULTS/national.php?f=1&year=2008&off=0&elect=0
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:224)
    at ParseData.main(ParseData.java:18)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Could someone please point me in the right direction of how I should be loading this url into jsoup?

A 403 error means the server is forbidding access. You just need to add the UserAgent property to HTTP header as follows:

Jsoup.connect("http://uselectionatlas.org/RESULTS/national.php?f=1&year=2008&off=0&elect=0")
.userAgent("Mozilla/5.0")
.timeout(0).get();

Some sites do not allow robots, that is what is happening for this site. You have to add a user agent so it does not get restricted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM