I'm developing an app to get legislation online and automatically parse and format it to fit the app. The test site i'm using is
http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm
I want to grab all the contents of that URL, parse (maybe clean) them and put them in a file. I'm using Jsoup, this is the Runnable I use to connect and print the content to file:
class FetchHtmlRunnable implements Runnable {
String url;
FetchHtmlRunnable(String url) {
this.url = url;
}
@Override
public void run() {
try {
Document doc = Jsoup.parse(new URL(url), 10000);
doc.charset(Charset.forName("windows-1252"));
Charset charset = doc.charset();
String htmlString = Jsoup.clean(doc.toString(), new Whitelist());
Log.d(TAG, "run: HTMLSTRING: " + htmlString);
String root = context.getFilesDir().toString();
file = new File(root + File.separator + "law.txt");
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file, false), charset);
out.write(htmlString);
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
However, even though Chrome tells me the site's encoding is windows-1252, both the log entry and the file is not only filled with replacement characters (it loses all character with diacritics, such as í and ã), it also loses all new lines:
Constitui o Presid ncia da Rep blica Casa Civil Subchefia para Assuntos Jur dicos CONSTITUI O DA REP BLICA FEDERATIVA DO BRASIL DE 1988 Vide Emenda Constitucional n 91, de 2016 Vide Emenda Constitucional n 106, de 2020 Vide Emenda Constitucional n 107, de 2020 Emendas Constitucionais Emendas Constitucionais de Revis o Ato das Disposi es Constitucionais Transit rias Atos decorrentes do disposto no 3 do art. 5 NDICE TEM TICO Texto compilado PRE MBULO N s, representantes do povo brasileiro, reunidos em Assembl ia Nacional Constituinte para instituir um Estado Democr tico, destinado a assegurar o exerc cio dos direitos sociais e individuais, a liberdade, a seguran a, o bem-estar, o desenvolvimento, a igualdade ea justi a como valores supremos de uma sociedade fraterna, pluralista e sem preconceitos, fundada na harmonia social e comprometida
Maybe someone better at web dev can tell me if that's a problem with the webpage itslef and how I can work around that... And how I can keep the newline characters.
I will write the remainder of this answer about Character Sets in Portuguese, Spanish (And Chinese) in just a second... First, though, let me say that the page you are trying to read - actually loads the contents of the page using "AJAX / JS"
. I can download AJAX
using my own library available on the Internet, but other tools like Selenium
, Puppeteer
, or Splash
would be necessary. Without mentioning character sets, how are you downloading the contents of your "Brazilian Constitution" to HTML in the first place? When I try a straight HTML downloader (no script execution), I get a pile of Java-Script without any Portuguese at all - and it looks nothing like the HTML posted in your question. :)
If you are already downloading the HTML, and only have a problem with the character set, read the answer below. If you have been unable to download anything but the AJAX / JavaScript calls - I can post another answer that explains executing JS / AJAX in one or two lines in a different answer. (Essentially, what you posted isn't the same output that I'm getting).
In 99.9999% of the cases, if it is not straight up "ASCII"
(because it has foreign language characters), then it is (almost) guaranteed to be readable using a version of "UTF-8"
Character-Set. I translate Spanish News Articles and also Chinese News Articles - and UTF-8
always works for me. I had one Spanish Site that expected an encoding called "iso8859-1"
, but other than the "Don Quijote de La Mancha" site where I found it - UTF8 works.
To tell you the truth, it is never an issue at all because when reading a web-page (as opposed to writing one), Java has automatically parsed the text as if it were UTF-8 without any configurations whatsoever. Here is the "Open Connection" Method Body from a library I have written:
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod ("GET");
if (USE_USER_AGENT) con.setRequestProperty ("User-Agent", USER_AGENT);
return new BufferedReader (new InputStreamReader(con.getInputStream()));
Here is the method body of a "Scrape Contents" method from my library:
URL url = new URL("http://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm");
StringBuilder sb = new StringBuilder();
String s;
BufferedReader br = Scrape.openConn(url);
while ((s = br.readLine()) != null) sb.append(s + "\n");
FileRW.writeFile(sb.toString(), "page.html");
I don't know the first thing about Microsoft Character Sets, to be fully honest with you. I have coded in UNIX, and I have never worried about any character sets - other than to make sure that when writing HTML (as opposed to Reading HTML ), that the an HTML <META CHARSET="utf-8">
element is inserted into my pages.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.