简体   繁体   中英

java Convert Hex NCRs texts to unicode characters

I'm making a feed reader app for local languages. A news site provide rss feed with these characters

ഹലോ സ്റ്റാക്ക്ഓവർ ഫ്ലോ

Which actually means ഹലോ സ്റ്റാക്ക്ഓവർ ഫ്ലോ this is also what I want display in my app..

How can I convert this input to the required form..

Try this.

String input = "ഹലോ സ്റ"
    + "്റാക്ക്ഓ"
    + "വർ ഫ്ലോ";
Pattern HEX = Pattern.compile("(?i)&#x([0-9a-f]+);|&#(\\d+);");
Matcher m = HEX.matcher(input);
StringBuffer sb = new StringBuffer();
while (m.find())
    m.appendReplacement(sb,
        String.valueOf((char) (m.group(1) != null ?
            Integer.parseInt(m.group(1), 16) :
            Integer.parseInt(m.group(2)))));
m.appendTail(sb);
String output = sb.toString();
System.out.println(output);
// -> ഹലോ സ്റ്റാക്ക്ഓവർ ഫ്ലോ

This code can handle also decimal NCR. But cannot handle x10000 to x10FFFF .

Or you can use Jsoup like this.

Document doc = Jsoup.parse(input);
String output = doc.text();
System.out.println(output);
// -> ഹലോ സ്റ്റാക്ക്ഓവർ ഫ്ലോ

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM