[英]HttpURLConnection with https InputStream Garbled
I use HttpURLConnection to crawler https://translate.google.com/ . 我使用HttpURLConnection爬网https://translate.google.com/ 。
InetSocketAddress addr = new InetSocketAddress("127.0.0.1", 1082);
Proxy proxy = new Proxy(Proxy.Type.HTTP, addr);
url = new URL("https://translate.google.com/");
HttpURLConnection conn = (HttpURLConnection) url.openConnection(proxy);
conn.setRequestProperty("Accept-Encoding", "gzip, deflate, sdch");
conn.setRequestProperty("Connection", "keep-alive");
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36");
conn.setRequestProperty("Accept", "*/*");
Map<String, List<String>> reqHeaders = conn.getHeaderFields();
List<String> reqTypes = reqHeaders.get("Content-Type");
for (String ss : reqTypes) {
System.out.println(ss);
}
InputStream in = conn.getInputStream();
String s = IOUtils.toString(in, "UTF-8");
System.out.println(s.substring(0, 100));
Map<String, List<String>> resHeader = conn.getHeaderFields();
List<String> resTypes = resHeader.get("Content-Type");
for (String ss : resTypes) {
System.out.println(ss);
}
Console is 控制台是
But When I change url to http://translate.google.com/ . 但是,当我将网址更改为http://translate.google.com/时 。 It works well.
它运作良好。
I know actually HttpURLConnection is HttpsURLConnection when i crawler https://translate.google.com/ . 我知道当我搜寻器https://translate.google.com/时HttpURLConnection是HttpsURLConnection。 I try to use HttpsURLConnection and it still garbled.
我尝试使用HttpsURLConnection,但仍然出现乱码。
Any suggestions? 有什么建议么?
conn.setRequestProperty("Accept-Encoding", "gzip, deflate, sdch");
conn.setRequestProperty(“ Accept-Encoding”,“ gzip,deflate,sdch”);
The response is compressed, because the above line tells the server that the client is able to understand encodings specified in Accept-Encoding
. 响应被压缩,因为上面的行告诉服务器客户端可以理解
Accept-Encoding
指定Accept-Encoding
。
Try to comment this line or handle this situation. 尝试评论此行或处理这种情况。
There's a more specific implementation for HTTPS ie HttpsURLConnection
, in case you're interested in https-specific features, eg: 如果您对https特定的功能感兴趣,则有一个HTTPS的更具体的实现,即
HttpsURLConnection
,例如:
import javax.net.ssl.HttpsURLConnection;
....
URL url = new URL("https://www.google.com/");
HttpsURLConnection conn = (HttpsURLConnection) url.openConnection();
I accept Jerry Chin's answer.Solves my problem. 我接受杰里·钦(Jerry Chin)的回答。 My answer just recording how i resolve this problem.
我的答案只是记录我如何解决此问题。 If this approach is unreasonable.Let me know, I'll remove this answer.
如果这种方法不合理。请告诉我,我将删除此答案。
conn.setRequestProperty("Accept-Encoding", "gzip, deflate, sdch");
And then I check response Content-Encoding.It's gzip. 然后我检查响应Content-Encoding.gzip。
So i use GZIPInputStream to receive. 所以我用GZIPInputStream接收。
InputStream in = conn.getInputStream();
GZIPInputStream gzis=new GZIPInputStream(in);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(reader);
The InputStream is normal. InputStream正常。
BTW,If you don't need Accept-Encoding,you can remove it. 顺便说一句,如果您不需要接受编码,则可以将其删除。
And do not forget check user-agent. 并且不要忘记检查用户代理。 It's very important and different operating systems corresponding to different user-agent.
这非常重要,并且不同的操作系统对应于不同的用户代理。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.