简体   繁体   中英

Trying to download html page via url in JAVA. Getting some weird symbols instead

So I am trying to download this page http://www.csfd.cz/film/895-28-dni-pote/prehled/ . I am using this code:

    URL url = new URL("http://www.csfd.cz/film/895-28-dni-pote/prehled/");
        try(BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream(),Charset.forName("UTF-8")))){
            String line = br.readLine();
            while(line != null){
                System.out.println(line);
                line = br.readLine();
 }

It worked on some other pages, but now it is giving me some weird symbols. For example the second line I get is: " \\ ? c n ". (It has not been copied exactly as I see it in eclipse console.)

I think I am using UTF-8 encoding as is the page. In case you are wondering it is in Czech. Thanks for help.

$ curl -D- http://www.csfd.cz/film/895-28-dni-pote/prehled/
HTTP/1.1 200 OK
Server: nginx
Date: Mon, 01 Feb 2016 08:11:36 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
X-Frame-Options: SAMEORIGIN
X-Powered-By: Nette Framework
Vary: X-Requested-With
X-From-Cache: TRUE
Content-Encoding: gzip`

▒}I▒▒▒▒^▒▒29B▒▒▒$R▒M▒$nER▒▒4X, @
etc....

Notice Content-Encoding: gzip - the content is compressed using gzip, and you will need to decompress it in order to use it.

Study the classes in java.util.zip , especially GzipInputStream , which I believe you can wrap around a regular input stream.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM