简体   繁体   English

Java URLConnection utf-8编码不起作用

[英]Java URLConnection utf-8 encoding doesn't work

I'm writing a small crawler for sites in English only, and doing that by opening a URL connection. 我正在为英文网站编写一个小型爬虫,并通过打开URL连接来实现。 I set the encoding to utf-8 both on the request, and the InputStreamReader but I continue to get gobbledigook for some of the requests, while others work fine. 我在请求和InputStreamReader上都将编码设置为utf-8 ,但是我继续对某些请求进行gobbledigook,而其他的工作正常。

The following code represents all the research I did and advice out there. 以下代码代表我所做的所有研究和建议。 I have also tried changing URLConnection to HttpURLConnection with no luck. 我也尝试过将URLConnection更改为HttpURLConnection而没有运气。 Some of the returned strings continue to look like this: 一些返回的字符串继续如下所示:

??}?r?H????P?n?c??]?d?G?o??Xj{?x?"P$a?Qt?#&??e?a#?????lfVx)?='b?"Y(defUeefee=??????.??a8??{O??????zY?2?M???3c??@ ??}·R 2 H ???? P + N + C 17] d?g 3 O 10 XJ {?X?“P $一个?Qt的?#&?? E'一#???? ?lfVx)?='b'“Y(defUeefee = ??????。?? A8 ?? {ö?????? ZY?2?m ??? 3C ?? @

What am I missing? 我错过了什么?

My code: 我的代码:

public static String getDocumentFromUrl(String urlString) throws Exception {
    String wholeDocument = null;

        URL url = new URL(urlString);
        URLConnection conn = url.openConnection();
        conn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
        conn.setRequestProperty("Accept-Charset", "utf-8");
        conn.setConnectTimeout(60*1000); // wait only 60 seconds for a response
        conn.setReadTimeout(60*1000);
        InputStreamReader isr = new InputStreamReader(conn.getInputStream(), "utf-8");
        BufferedReader in = new BufferedReader(isr);

        String inputLine;
        while ((inputLine = in.readLine()) != null) {
            wholeDocument += inputLine;     
        }       
        isr.close();
        in.close();         

    return wholeDocument;
}

The server is sending the document GZIP compressed. 服务器正在发送压缩文档GZIP。 You can set the Accept-Encoding HTTP header to make it send the document in plain text. 您可以设置Accept-Encoding HTTP标头,使其以纯文本格式发送文档。

conn.setRequestProperty("Accept-Encoding", "identity");

Even so, the HTTP client class handles GZIP compression for you, so you shouldn't have to worry about details like this. 即便如此,HTTP客户端类也会为您处理GZIP压缩,因此您不必担心这样的细节。 What seems to be going on here is that the server is buggy: it does not send the Content-Encoding header to tell you the content is compressed. 这里似乎发生的是服务器有问题:它不发送Content-Encoding标头来告诉你内容被压缩。 This behavior seems to depend on the User-Agent , so that the site works in regular web browsers but breaks when used from Java. 此行为似乎取决于User-Agent ,因此该站点在常规Web浏览器中工作,但在使用Java时会中断。 So, setting the user agent also fixes the issue: 因此,设置用​​户代理也可以解决问题:

conn.setRequestProperty("User-Agent", "Mozilla/5.0"); // for example

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM