简体   繁体   English

将Unicode数据符号转换为字符串

[英]Converting unicode data symbols to Strings

I have been unable to fix a problem with Java Unicode and encoding. 我无法解决Java Unicode和编码问题。 The problem is that I have 5,000+ Strings like: "Steve O#8217Conord and Mirco Savas" and ..."Rusell O&#146Connell" where "#8217" and "&#146" must be replaced with an apostrophe. 问题是我有5,000多个字符串,例如:“ Steve O#8217Conord和Mirco Savas”和...“ Rusell O&#146Connell”,其中“#8217”和“&#146”必须替换为撇号。

But there are many many different variants of this code and the apache.common.lang.StringUtils Apache library isn't helping me solve the problem because the library needs the "&" char at the beginning and ";" 但是此代码有许多不同的变体,并且apache.common.lang.StringUtils Apache库无法帮助我解决问题,因为该库的开头需要“&”字符和“;”。 at the end. 在末尾。 I can't input these everywhere because there are over 5,000 strings. 我不能在任何地方输入这些信息,因为有超过5,000个字符串。 So if there is any way using regex or something else to find these sequences in the strings and replace them with apostrophes, I'll be glad to hear it :) 因此,如果可以使用正则表达式或其他方法在字符串中找到这些序列并将其替换为撇号,我将很高兴听到:)

Additionally, there are some symbols like "O’" and they are a big problem because they should be read in UTF8. 此外,还有一些符号,例如“O’”,这是一个大问题,因为应该在UTF8中读取它们。 I mean like (\脧) and other characters. 我的意思是(\\ u8127)和其他字符。 Do you have any suggestions? 你有什么建议吗?

Try something like this: 尝试这样的事情:

import java.io.FileOutputStream;
import java.io.PrintWriter;
import org.apache.commons.lang3.StringEscapeUtils;

String[] myStringPool = {"Steve O#8217Conord and Mirco Savas","Rusell O&#146Connell"};
PrintWriter pw = new PrintWriter("utf-8.txt", "UTF-8");
for(String string : myStringPool) {
    pw.println(StringEscapeUtils.unescapeXml(string.replaceAll("&?#(\\d+);?", "&#$1;")));
}
pw.close();

Assuming you already have these strings accesible, string.replaceAll("&?#(\\\\d+);?", "&#$1;") cleans up the XML entities in the strings to be unescaped by org.apache.commons.lang3.StringEscapeUtils (Get it here ), the strings are finally written to a file in UTF-8 format. 假设您已经可以使用这些字符串,则string.replaceAll("&?#(\\\\d+);?", "&#$1;")清除要由org.apache.commons.lang3.StringEscapeUtils取消转义的字符串中的XML实体org.apache.commons.lang3.StringEscapeUtils在此处获取),最后将字符串以UTF-8格式写入文件。

Note that Java can be configured to automatically read and write files as UTF-8. 请注意,可以将Java配置为以UTF-8自动读取和写入文件。 Java should automatically use the right encoding for your system. Java应该为您的系统自动使用正确的编码。 It is generally a bad idea to explicitly write files as a certain encoding, unless you really know what you are doing. 除非您真的知道自己在做什么,否则通常以一种特定的编码方式显式写入文件通常是一个坏主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM