简体   繁体   English

WGET和HttpClient可以工作,但是Jsoup在Java中不起作用

[英]WGET and HttpClient work but Jsoup doesn't work in java

I am trying to get the html source of a webpage through java code using Jsoup. 我正在尝试使用Jsoup通过Java代码获取网页的html源。 Below is the code I am using to fetch the page. 以下是我用来获取页面的代码。 I am getting a 500 Internal Server Error. 我收到500内部服务器错误。

  String encodedUrl = URIUtil.encodePathQuery(urlToFetch.trim(), "ISO-8859-1");
  Response res = Jsoup.connect(encodedUrl)
        .header("Accept-Language", "en")
        .userAgent(userAgent)
        .data(data)
        .maxBodySize(bodySize)
        .ignoreHttpErrors(true)
        .ignoreContentType(true)
        .timeout(10000)
        .execute();

However, when I fetch the same page with wget from command line, it works. 但是,当我从命令行使用wget获取同一页面时,它可以工作。 A simple HttpClient from code also works. 代码中的简单HttpClient也可以使用。

// Create an instance of HttpClient.
HttpClient client = new HttpClient();

// Create a method instance.
GetMethod method = new GetMethod(url);

// Provide custom retry handler is necessary
method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, 
        new DefaultHttpMethodRetryHandler(3, false));

try {
  // Execute the method.
  int statusCode = client.executeMethod(method);

  if (statusCode != HttpStatus.SC_OK) {
    System.err.println("Method failed: " + method.getStatusLine());
  }

  // Read the response body.
  byte[] responseBody = method.getResponseBody();

  // Deal with the response.
  // Use caution: ensure correct character encoding and is not binary data
  System.out.println(new String(responseBody));

} catch (HttpException e) {
  System.err.println("Fatal protocol violation: " + e.getMessage());
  e.printStackTrace();
} catch (IOException e) {
  System.err.println("Fatal transport error: " + e.getMessage());
  e.printStackTrace();
} finally {
  // Release the connection.
  method.releaseConnection();
}  

Is there anything I would need to change in the parameters for Jsoup.connect() method for it work? 我需要更改Jsoup.connect()方法的参数以使其工作吗?

This however does not happen for all urls. 但是,并非所有网址都会发生这种情况。 It is specifically happening for pages from this website: 该网站的页面专门发生了这种情况:

http://xyo.net/iphone-app/instagram-RrkBUFE/ http://xyo.net/iphone-app/instagram-RrkBUFE/

You need Accept header. 您需要Accept标头。

Try this: 尝试这个:

    String encodedUrl = "http://xyo.net/iphone-app/instagram-RrkBUFE/";

    Response res = Jsoup.connect(encodedUrl)
            .header("Accept-Language", "en")
            .ignoreHttpErrors(true)
            .ignoreContentType(true)
            .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
            .followRedirects(true)
            .timeout(10000)
            .method(Connection.Method.GET)
            .execute();


    System.out.println(res.parse());

It works. 有用。

Please also note that the site is trying to set cookies, you may need to handle them. 另请注意,该网站正在尝试设置Cookie,您可能需要处理它们。

Hope it will help. 希望它会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM