简体   繁体   English

java字符串unicode代码点转换为字符

[英]java string unicode code point convert to character

Ok, so I feel like this question for asked many times but I am not able to find an answer. 好的,所以我觉得这个问题已经问了很多遍了,但是我找不到答案。 I am comparing two different files that were generated by two different programs. 我正在比较由两个不同程序生成的两个不同文件。 Of course both programs are generating the files from the same db queries. 当然,两个程序都是从相同的数据库查询生成文件的。 I am running into the following differences: 我遇到以下差异:

s1 = Samsung - Mobile USB Chargers s1 = Samsung - Mobile USB Chargers

vs.

s2 = Samsung \– Mobile USB Chargers s2 = Samsung \– Mobile USB Chargers

How do I convert s2 to s1 or even better, how do I compare the two without getting a difference? 如何将s2转换为s1甚至更好,如何比较两者而又没有区别? Someone somewhere on the wide wide internets mentioned to use ApacheCommons-lang's StringUtils class, but I couldn't find anything useful. 提到互联网上某处有人在使用ApacheCommons-lang的StringUtils类,但是我找不到任何有用的东西。

You could fold all the characters with the Dash_Punctuation property . 您可以使用Dash_Punctuation属性折叠所有字符。

This code will print true : 此代码将显示true

boolean equal = "Samsung \u2013 Mobile USB Chargers"
                    .replaceAll("\\p{Pd}", "-")
                    .equals("Samsung - Mobile USB Chargers");
System.out.println(equal);

Note that this will apply to all characters with that property (like 〰 U+3030 WAVY DASH). 请注意,这将应用于具有该属性的所有字符(例如〰U + 3030 WAVY DASH)。 A comprehensive list of characters with the Dash_Punctuation (Pd) property are in UnicodeData.txt . 具有Dash_Punctuation(Pd)属性的完整字符列表位于UnicodeData.txt中 Java 6 supports Unicode 4. See chapter 6 for a discussion of punctuation. Java 6支持Unicode4。有关标点的讨论,请参见第6章

The program that generated the first string is writing the file in ASCII, using a character substitution fallback mechanism. 生成第一个字符串的程序正在使用字符替换回退机制以ASCII格式写入文件。 The second is writing the file in Unicode. 第二个是用Unicode编写文件。

These could be compared by making a copy of the second file in ASCII using the same fallback mechanism. 可以通过使用相同的回退机制以ASCII格式复制第二个文件来比较这些文件。

The best solution would be to modify the first program so that it also uses Unicode. 最好的解决方案是修改第一个程序,使其也使用Unicode。

(It is possible that the second file was using something other than Unicode, since some other character sets include the en dash. If so, then the best solution is to write both files in Unicode, if possible.) (第二个文件可能使用的不是Unicode,因为其他一些字符集包括破折号。如果是这样,则最好的解决方案是,如果可能的话,用Unicode编写两个文件。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM