简体   繁体   中英

How to read csv file from url with utf-8 chars in java

How to correctly read a.csv file in java? I have a utf-8 encoded file but cannot read certain characters correctly.

在此处输入图像描述

My code:

String link = new String("https://stat.gov.pl/download/gfx/portalinformacyjny/pl/defaultstronaopisowa/4741/1/1/miesieczne_wskazniki_cen_towarow_i_uslug_konsumpcyjnych_od_1982_roku.csv");

URL url = new URL(link);

BufferedReader read = new BufferedReader(
        new InputStreamReader(url.openStream(), StandardCharsets.UTF_8));
String i;
while ((i = read.readLine()) != null)
    System.out.println(i);
read.close();

https://stat.gov.pl/download/gfx/portalinformacyjny/pl/defaultstronaopisowa/4741/1/1/miesieczne_wskazniki_cen_towarow_i_uslug_konsumpcyjnych_od_1982_roku.csv

That is not UTF-8!

Hence why your code fails. You assumed it was UTF-8. It isn't. Also, the headers tell you it's 'binary' (it really isn't, but the point is, the server isn't giving you a charset either), so you have to guess. It's probably Windows-1250.

This byte sequence is in that CSV:

57 61 72 74 6F 9C E6

The last 2 are 'interesting' (the other are in the ASCII block so identical is just about every encoding). So that reads Warto?? where the? are the interesting parts. If this is Windows-1250 , it spells Wartość. Google tells me that's polish.

So, you need to do three things to fix this:

  1. Stop assuming everything is UTF_8.
  2. Learn mojibake detective skills. This involves downloading stuff raw, using hex editors, hunting for things that seem like half of a familiar term (like Warto..), and then looking up likely code pages and checking if the bytes match up to what you thought it should be. It's a drag. There are no shortcuts for it, when the server doesn't tell you what the encoding is, mojibake detective skills are your only option.
  3. Replace StandardCharsets.UTF_8 with "Windows-1250" , which I'm pretty sure works on any JVM. If not, oof. You'd have to write that codepage yourself and register it as a charset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM