简体   繁体   中英

Parse https with jsoup (java)

i try to parse a document with jsoup (java). This is my java-code:

    package test;

import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class crawler{
  private static final int TIMEOUT_IN_MS = 5000;

  public static void main(String[] args) throws MalformedURLException, IOException
  {
    Document doc = Jsoup.parse(new URL("http://www.internet.com/"), TIMEOUT_IN_MS);

    System.out.println(doc.html());
  }

}

Ok, this works. But when i want to parse a https site, i get this error message:

    Document doc = Jsoup.parse(new URL("https://www.somesite.com/"), TIMEOUT_IN_MS);

System.out.println(doc.html());

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL= https://www.somesite.com/ at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216) at org.jsoup.Jsoup.parse(Jsoup.java:183) at test.crawler.main(crawler.java:14)

I only get this error messages, when i try to parse https. http is working.

Jsoup supports https fine - it just uses Java's URLConnection under the hood.

A 403 server response indicates that the server has 'forbidden' the request, normally due to authorization issues. If you're getting a HTTP response status code, the TLS (https) negotiation has worked.

The issue here is probably not related to HTTPS, it just that the URL you're having troubles fetching happens to be HTTPS. You need to understand why the server is giving you a 403 - my guess is either you need to send some authorization tokens (cookies or URL params), or it is blocking the request because of the user agent (which defaults to "Java" unless you specify it). Lots of services block requests that way. Look to set the useragent to a common browser string. Use the Jsoup.Connect methods to do that.

(People won't be able to help you more without real example URLs, because we can't tell what the server is doing just with this info.)

You would need to provide authentication when hitting the URL. Also try the solution in 403 Forbidden with Java but not web browser? if the request works in a browser and not using JAVA code.

如果需要,您也可以忽略SSL证书

Jsoup.connect("https://example.com").validateTLSCertificates(false).get()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM