简体   繁体   中英

Android/ Jsoup: how to fix encoding issues

I'm developing an app to get legislation online and automatically parse and format it to fit the app. The test site i'm using is

http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm

I want to grab all the contents of that URL, parse (maybe clean) them and put them in a file. I'm using Jsoup, this is the Runnable I use to connect and print the content to file:

class FetchHtmlRunnable implements Runnable {
        String url;

        FetchHtmlRunnable(String url) {
            this.url = url;
        }

        @Override
        public void run() {
            try {
                Document doc = Jsoup.parse(new URL(url), 10000);
                doc.charset(Charset.forName("windows-1252"));
                Charset charset = doc.charset();

                String htmlString = Jsoup.clean(doc.toString(), new Whitelist());

                Log.d(TAG, "run: HTMLSTRING: " + htmlString);

                String root = context.getFilesDir().toString();
                file = new File(root + File.separator + "law.txt");

                OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file, false), charset);
                out.write(htmlString);
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }

However, even though Chrome tells me the site's encoding is windows-1252, both the log entry and the file is not only filled with replacement characters (it loses all character with diacritics, such as í and ã), it also loses all new lines:

Constitui o Presid ncia da Rep blica Casa Civil Subchefia para Assuntos Jur dicos CONSTITUI O DA REP BLICA FEDERATIVA DO BRASIL DE 1988 Vide Emenda Constitucional n 91, de 2016 Vide Emenda Constitucional n 106, de 2020 Vide Emenda Constitucional n 107, de 2020 Emendas Constitucionais Emendas Constitucionais de Revis o Ato das Disposi es Constitucionais Transit rias Atos decorrentes do disposto no 3 do art. 5 NDICE TEM TICO Texto compilado PRE MBULO N s, representantes do povo brasileiro, reunidos em Assembl ia Nacional Constituinte para instituir um Estado Democr tico, destinado a assegurar o exerc cio dos direitos sociais e individuais, a liberdade, a seguran a, o bem-estar, o desenvolvimento, a igualdade ea justi a como valores supremos de uma sociedade fraterna, pluralista e sem preconceitos, fundada na harmonia social e comprometida

Maybe someone better at web dev can tell me if that's a problem with the webpage itslef and how I can work around that... And how I can keep the newline characters.

I will write the remainder of this answer about Character Sets in Portuguese, Spanish (And Chinese) in just a second... First, though, let me say that the page you are trying to read - actually loads the contents of the page using "AJAX / JS" . I can download AJAX using my own library available on the Internet, but other tools like Selenium , Puppeteer , or Splash would be necessary. Without mentioning character sets, how are you downloading the contents of your "Brazilian Constitution" to HTML in the first place? When I try a straight HTML downloader (no script execution), I get a pile of Java-Script without any Portuguese at all - and it looks nothing like the HTML posted in your question. :)

If you are already downloading the HTML, and only have a problem with the character set, read the answer below. If you have been unable to download anything but the AJAX / JavaScript calls - I can post another answer that explains executing JS / AJAX in one or two lines in a different answer. (Essentially, what you posted isn't the same output that I'm getting).


In 99.9999% of the cases, if it is not straight up "ASCII" (because it has foreign language characters), then it is (almost) guaranteed to be readable using a version of "UTF-8" Character-Set. I translate Spanish News Articles and also Chinese News Articles - and UTF-8 always works for me. I had one Spanish Site that expected an encoding called "iso8859-1" , but other than the "Don Quijote de La Mancha" site where I found it - UTF8 works.

To tell you the truth, it is never an issue at all because when reading a web-page (as opposed to writing one), Java has automatically parsed the text as if it were UTF-8 without any configurations whatsoever. Here is the "Open Connection" Method Body from a library I have written:

HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
con.setRequestMethod                        ("GET");
if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
return new BufferedReader                   (new InputStreamReader(con.getInputStream()));

Here is the method body of a "Scrape Contents" method from my library:

URL url = new URL("http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm");
StringBuilder sb = new StringBuilder();
String s;
BufferedReader br = Scrape.openConn(url);
while ((s = br.readLine()) != null) sb.append(s + "\n");
FileRW.writeFile(sb.toString(), "page.html");

I don't know the first thing about Microsoft Character Sets, to be fully honest with you. I have coded in UNIX, and I have never worried about any character sets - other than to make sure that when writing HTML (as opposed to Reading HTML ), that the an HTML <META CHARSET="utf-8"> element is inserted into my pages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM