简体   繁体   中英

HttpURLConnection with https InputStream Garbled

I use HttpURLConnection to crawler https://translate.google.com/ .

        InetSocketAddress addr = new InetSocketAddress("127.0.0.1", 1082);
        Proxy proxy = new Proxy(Proxy.Type.HTTP, addr);
        url = new URL("https://translate.google.com/");
        HttpURLConnection conn = (HttpURLConnection) url.openConnection(proxy);
        conn.setRequestProperty("Accept-Encoding", "gzip, deflate, sdch");
        conn.setRequestProperty("Connection", "keep-alive");
        conn.setRequestProperty("User-Agent",
                "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36");
        conn.setRequestProperty("Accept", "*/*");

        Map<String, List<String>> reqHeaders = conn.getHeaderFields();
        List<String> reqTypes = reqHeaders.get("Content-Type");
        for (String ss : reqTypes) {
            System.out.println(ss);
        }

        InputStream in = conn.getInputStream();
        String s = IOUtils.toString(in, "UTF-8");
        System.out.println(s.substring(0, 100));

        Map<String, List<String>> resHeader = conn.getHeaderFields();
        List<String> resTypes = resHeader.get("Content-Type");
        for (String ss : resTypes) {
            System.out.println(ss);
        }

Console is

在此处输入图片说明

But When I change url to http://translate.google.com/ . It works well.

I know actually HttpURLConnection is HttpsURLConnection when i crawler https://translate.google.com/ . I try to use HttpsURLConnection and it still garbled.

Any suggestions?

conn.setRequestProperty("Accept-Encoding", "gzip, deflate, sdch");

The response is compressed, because the above line tells the server that the client is able to understand encodings specified in Accept-Encoding .

Try to comment this line or handle this situation.

There's a more specific implementation for HTTPS ie HttpsURLConnection , in case you're interested in https-specific features, eg:

import javax.net.ssl.HttpsURLConnection;

....

URL url = new URL("https://www.google.com/");
HttpsURLConnection conn = (HttpsURLConnection) url.openConnection();

I accept Jerry Chin's answer.Solves my problem. My answer just recording how i resolve this problem. If this approach is unreasonable.Let me know, I'll remove this answer.

conn.setRequestProperty("Accept-Encoding", "gzip, deflate, sdch");

And then I check response Content-Encoding.It's gzip.

So i use GZIPInputStream to receive.

InputStream in = conn.getInputStream();
GZIPInputStream gzis=new GZIPInputStream(in);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(reader);

The InputStream is normal.

BTW,If you don't need Accept-Encoding,you can remove it.

And do not forget check user-agent. It's very important and different operating systems corresponding to different user-agent.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM