简体   繁体   中英

Removing paragraph tags in Java

Ive got a java program that downloads some files from the internet using HTMLUnit.

I'm trying to format these files into a CSV/Excel sheet.

My issue is that I cant seem to get the formatting quite right; One datum is sorted into a header rather than an instance.

I can tell in Microsoft word that the paragraph symbol is the issue. However I'm not sure what this translates to in Java. It isn't /n for newline.

What does the paragraph symbol (ASCII: ALT-244) translate to in java? How can I remove or add this symbol for proper formatting?

ps- trim() isnt doin it.

Thank you.

The unicode symbol 244 is ô. The ASCII table goes from 0 to 127, more or less.

Therefore, this statement:

What does the paragraph symbol (ASCII: ALT-244)

indicates you are confused. There is no ascii 244. If you meant unicode: 244 is not the paragraph symbol. If you meant ISO-8859-1: That one also has ô in the 244 spot. As Does Cp1252. MacRoman has Ù. I'm starting to run out of commonly used encodings, so whatever you have is a weirder variant I'm not aware.

IBM852 and IBM850 has §, but at 245. - so presumably you've miscopied something or misunderstood something, and it's IBM852, because that is too much of a coincidence.

Note that the IBM850 charset is from the DOS era. Are you talking to a machine that's in a museum, perhaps? That's.. one heck of a dated charset!

What you have is text, encoded as bytes. That's.. normal, computers do bytes. However, anytime you take your bytes and interpret them as chars, you have to tell the computer how. If you don't, the computer will guess, and you may safely assume that the computer is going to guess wrong exactly to screw you over: It'll work on your machine and pass in your tests and then fail at runtime. The solution is to never let the computer guess . Whenever bytes are turned into chars or vice versa, always specify a charset or ensure that the method you use is explicitly documented to pick a set, defined charset.

For example, if you have the response from the webserver in a byte array and you're turning it into a string, then:

byte[] data = htmlResponse.getAllData();
String html = new String(bytes); // Don't ever call this constructor.
String html = new String(bytes, "IBM-852"); // Correct!

If you're fetching this stuff from HTML, the charset will be in the response header. HTMLUnit should be taking care of this for you; apparently it is not, sounds like the web server in question is bugged and is sending the wrong charset data. That, or your code is bugged and you've messed up the encoding conversion yourself.

It's just a symbol, strings in java are more or less unicode. Once you've read it in properly (and I bet you haven't, that's part of the problem here), if you want to remove it, that's trivial:

byte[] data = htmlResponse.getAllData();
String html = new String(bytes, "IBM-852");

// replace the § symbol with nothing (i.e. remove it)
html = html.replace("§", "");

NB: The easy way to do charsets is to encode everything in UTF-8, everywhere. Especially the web. If you can talk to the owner of that website or whomever wrote the software for it, you should tell them to do that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM