简体   繁体   English

使用Java bufferedReader从URL获取html

[英]Using Java bufferedreader to get html from URL

I'm trying to read all the html from a page using a buffered reader like follows 我正在尝试使用如下所示的缓冲读取器从页面读取所有html

 String charset = "UTF-8";
 URLConnection connection = new URL(url).openConnection();
    connection.addRequestProperty("User-Agent", 
                    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
    connection.setRequestProperty("Accept-Charset", charset);
    InputStream response = connection.getInputStream();
    BufferedReader br = new  BufferedReader(new InputStreamReader(response,charset));

then I'm reading it line by line like this: 然后我像这样逐行阅读它:

String data = br.readLine();
while(data != null){
data = br.readLine();
}

the problem is I'm getting something like: 问题是我得到类似的东西:

}$B!)(BL$B!)(Bu"~$B!)$(D"C(B|X$B!x!)!x(B}

I've tried this: 我已经试过了:

do {
        data = br.readLine();
        SortedMap<String, Charset> map = Charset.availableCharsets();
        for(Map.Entry<String, Charset> entry : map.entrySet()){
            System.out.println(entry.getKey());

            try {
                System.out.println(new String(data.getBytes(entry.getValue())));
            } catch (Exception e) {
                e.printStackTrace();
            }

        }
}while(data!=null)

and I'm not getting any readable html in any of them. 而且我也没有任何可读的html。 This really weird since it was working fine until this morning and I didn't change anything.. What am I doing wrong here? 这真的很奇怪,因为直到今天早上都可以正常工作,我什么都没改变。 is it possible that something changed in the website I'm trying to read? 我尝试阅读的网站是否可能有所更改? please help. 请帮忙。

The Server has changed his transfer mode to compressed data, what you can see in response header from server: 服务器已将其传输模式更改为压缩数据,您可以在服务器的响应标头中看到以下内容:

Connection:keep-alive
Content-Encoding:gzip
Content-Type:text/html; charset=utf-8
Date:Mon, 09 Mar 2015 09:34:41 GMT
Server:nginx
Transfer-Encoding:chunked
Vary:Accept-Encoding
X-Powered-By:PHP/5.5.16-pl0-gentoo

As you can see the content encoding is set to gzip Content-Encoding:gzip . 如您所见,内容编码设置为gzip Content-Encoding:gzip So you have to decode the zipped content first: 因此,您必须首先解码压缩的内容:

GZIPInputStream gzis = new GZIPInputStream(connection.getInputStream());
BufferedReader br = new  BufferedReader(new InputStreamReader(gzis,charset));

To view the headers of requests and responses you could use a network monitor (see Free Network Monitor ). 要查看请求和响应的标头,可以使用网络监视器(请参阅Free Network Monitor )。

Simpler is it to use the developer plugins integrated in most common browsers. 使用大多数常见浏览器中集成的开发人员插件更为简单。 Here is the documentation of Chrome DevTools, how to use the network tab: https://developer.chrome.com/devtools/docs/network 以下是Chrome DevTools的文档,以及如何使用网络标签: https : //developer.chrome.com/devtools/docs/network

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM