简体   繁体   中英

Converting unicode data symbols to Strings

I have been unable to fix a problem with Java Unicode and encoding. The problem is that I have 5,000+ Strings like: "Steve O#8217Conord and Mirco Savas" and ..."Rusell O&#146Connell" where "#8217" and "&#146" must be replaced with an apostrophe.

But there are many many different variants of this code and the apache.common.lang.StringUtils Apache library isn't helping me solve the problem because the library needs the "&" char at the beginning and ";" at the end. I can't input these everywhere because there are over 5,000 strings. So if there is any way using regex or something else to find these sequences in the strings and replace them with apostrophes, I'll be glad to hear it :)

Additionally, there are some symbols like "O’" and they are a big problem because they should be read in UTF8. I mean like (\脧) and other characters. Do you have any suggestions?

Try something like this:

import java.io.FileOutputStream;
import java.io.PrintWriter;
import org.apache.commons.lang3.StringEscapeUtils;

String[] myStringPool = {"Steve O#8217Conord and Mirco Savas","Rusell O&#146Connell"};
PrintWriter pw = new PrintWriter("utf-8.txt", "UTF-8");
for(String string : myStringPool) {
    pw.println(StringEscapeUtils.unescapeXml(string.replaceAll("&?#(\\d+);?", "&#$1;")));
}
pw.close();

Assuming you already have these strings accesible, string.replaceAll("&?#(\\\\d+);?", "&#$1;") cleans up the XML entities in the strings to be unescaped by org.apache.commons.lang3.StringEscapeUtils (Get it here ), the strings are finally written to a file in UTF-8 format.

Note that Java can be configured to automatically read and write files as UTF-8. Java should automatically use the right encoding for your system. It is generally a bad idea to explicitly write files as a certain encoding, unless you really know what you are doing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM