[英]Java URLConnection utf-8 encoding doesn't work
I'm writing a small crawler for sites in English only, and doing that by opening a URL
connection. 我正在为英文网站编写一个小型爬虫,并通过打开
URL
连接来实现。 I set the encoding to utf-8
both on the request, and the InputStreamReader
but I continue to get gobbledigook for some of the requests, while others work fine. 我在请求和
InputStreamReader
上都将编码设置为utf-8
,但是我继续对某些请求进行gobbledigook,而其他的工作正常。
The following code represents all the research I did and advice out there. 以下代码代表我所做的所有研究和建议。 I have also tried changing
URLConnection
to HttpURLConnection
with no luck. 我也尝试过将
URLConnection
更改为HttpURLConnection
而没有运气。 Some of the returned strings continue to look like this: 一些返回的字符串继续如下所示:
??}?r?H????P?n?c??]?d?G?o??Xj{?x?"P$a?Qt?#&??e?a#?????lfVx)?='b?"Y(defUeefee=??????.??a8??{O??????zY?2?M???3c??@ ??}·R 2 H ???? P + N + C 17] d?g 3 O 10 XJ {?X?“P $一个?Qt的?#&?? E'一#???? ?lfVx)?='b'“Y(defUeefee = ??????。?? A8 ?? {ö?????? ZY?2?m ??? 3C ?? @
What am I missing? 我错过了什么?
My code: 我的代码:
public static String getDocumentFromUrl(String urlString) throws Exception {
String wholeDocument = null;
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
conn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
conn.setRequestProperty("Accept-Charset", "utf-8");
conn.setConnectTimeout(60*1000); // wait only 60 seconds for a response
conn.setReadTimeout(60*1000);
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), "utf-8");
BufferedReader in = new BufferedReader(isr);
String inputLine;
while ((inputLine = in.readLine()) != null) {
wholeDocument += inputLine;
}
isr.close();
in.close();
return wholeDocument;
}
The server is sending the document GZIP compressed. 服务器正在发送压缩文档GZIP。 You can set the
Accept-Encoding
HTTP header to make it send the document in plain text. 您可以设置
Accept-Encoding
HTTP标头,使其以纯文本格式发送文档。
conn.setRequestProperty("Accept-Encoding", "identity");
Even so, the HTTP client class handles GZIP compression for you, so you shouldn't have to worry about details like this. 即便如此,HTTP客户端类也会为您处理GZIP压缩,因此您不必担心这样的细节。 What seems to be going on here is that the server is buggy: it does not send the
Content-Encoding
header to tell you the content is compressed. 这里似乎发生的是服务器有问题:它不发送
Content-Encoding
标头来告诉你内容被压缩。 This behavior seems to depend on the User-Agent
, so that the site works in regular web browsers but breaks when used from Java. 此行为似乎取决于
User-Agent
,因此该站点在常规Web浏览器中工作,但在使用Java时会中断。 So, setting the user agent also fixes the issue: 因此,设置用户代理也可以解决问题:
conn.setRequestProperty("User-Agent", "Mozilla/5.0"); // for example
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.