简体   繁体   中英

WGET and HttpClient work but Jsoup doesn't work in java

I am trying to get the html source of a webpage through java code using Jsoup. Below is the code I am using to fetch the page. I am getting a 500 Internal Server Error.

  String encodedUrl = URIUtil.encodePathQuery(urlToFetch.trim(), "ISO-8859-1");
  Response res = Jsoup.connect(encodedUrl)
        .header("Accept-Language", "en")
        .userAgent(userAgent)
        .data(data)
        .maxBodySize(bodySize)
        .ignoreHttpErrors(true)
        .ignoreContentType(true)
        .timeout(10000)
        .execute();

However, when I fetch the same page with wget from command line, it works. A simple HttpClient from code also works.

// Create an instance of HttpClient.
HttpClient client = new HttpClient();

// Create a method instance.
GetMethod method = new GetMethod(url);

// Provide custom retry handler is necessary
method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, 
        new DefaultHttpMethodRetryHandler(3, false));

try {
  // Execute the method.
  int statusCode = client.executeMethod(method);

  if (statusCode != HttpStatus.SC_OK) {
    System.err.println("Method failed: " + method.getStatusLine());
  }

  // Read the response body.
  byte[] responseBody = method.getResponseBody();

  // Deal with the response.
  // Use caution: ensure correct character encoding and is not binary data
  System.out.println(new String(responseBody));

} catch (HttpException e) {
  System.err.println("Fatal protocol violation: " + e.getMessage());
  e.printStackTrace();
} catch (IOException e) {
  System.err.println("Fatal transport error: " + e.getMessage());
  e.printStackTrace();
} finally {
  // Release the connection.
  method.releaseConnection();
}  

Is there anything I would need to change in the parameters for Jsoup.connect() method for it work?

This however does not happen for all urls. It is specifically happening for pages from this website:

http://xyo.net/iphone-app/instagram-RrkBUFE/

You need Accept header.

Try this:

    String encodedUrl = "http://xyo.net/iphone-app/instagram-RrkBUFE/";

    Response res = Jsoup.connect(encodedUrl)
            .header("Accept-Language", "en")
            .ignoreHttpErrors(true)
            .ignoreContentType(true)
            .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
            .followRedirects(true)
            .timeout(10000)
            .method(Connection.Method.GET)
            .execute();


    System.out.println(res.parse());

It works.

Please also note that the site is trying to set cookies, you may need to handle them.

Hope it will help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM