如何去除 Unicode 十进制值 Unicode Java 中的字符串中的杂项符号

Question

I am working on removing or replacing the miscellaneous symbols in a string (in Java) that is being used in text area field in a web application.What is happening is when I use this below content that content is being converted into unicode decimal representation values.我正在删除或替换 web 应用程序的文本区域字段中使用的字符串（Java 中）中的杂项符号。发生的情况是，当我使用以下内容时，内容正在转换为 unicode 十进制表示值.

The content is: String a = 'Last Search Results Bulletin Board Validations ⛔ 0 ⚡ 1 ⚠ 6?内容为： String a = 'Last Search Results Bulletin Board Validations ⛔ 0 ⚡ 1 ⚠ 6? 0' 0'

when I save that content in the text area( that is in a web page ), that symbols are being saved as ⛔,⚡,⚠.当我将该内容保存在文本区域（即在 web 页面中）时，该符号被保存为 ⛔,⚡,⚠。

I want to remove the unicode representation values (or) save the content in the proper format so that I can have proper data to save into the Database.我想删除 unicode 表示值（或）以正确的格式保存内容，以便我可以将正确的数据保存到数据库中。

How Do I remove the unicode representation values for symbols ('⛔' or '⚡' or '⚠') from String?如何从字符串中删除符号（'⛔'或'⚡'或'⚠'）的 unicode 表示值？ Actually I tried to have regular expression to replace those representations like below s.replaceAll("&#[9728 - 9983];", " ").实际上，我尝试使用正则表达式来替换下面 s.replaceAll("&#[9728 - 9983];", "") 中的那些表示。 The range [9728 - 9983] represents the miscellaneous symbol unicode decimal values range.But it is not replacing it properly.范围 [9728 - 9983] 表示杂项符号 unicode 十进制值范围。但它没有正确替换它。 Which regular expression can I use?我可以使用哪个正则表达式？ or Which approach can Use to remove the values in a String?或者可以使用哪种方法来删除字符串中的值？

(or) （或者）

How Do I convert unicode representation values('⛔' or '⚡' or '⚠') into again same symbols (⛔,⚡,⚠ ) in the String?如何将 unicode 表示值（'⛔' or '⚡' or '⚠'）再次转换为字符串中的相同符号（⛔,⚡,⚠）？

Answer 1

I haven't found such utility in stock Java.我在库存 Java 中没有找到这样的实用程序。 You'll just have to do it the 'hard' way.你只需要以“硬”的方式去做。

Note that this doesn't cover the hexadeximal equivalents (eg ⛔) or decimal values with lengths not equal to 4.请注意，这不包括十六进制等效项（例如 ⛔）或长度不等于 4 的十进制值。

public static String htmlCharsDecode(String string) {
    int           length = string.length();
    StringBuilder out    = new StringBuilder(length);

    NumberFormat  parser = NumberFormat.getInstance();
    ParsePosition pos       = new ParsePosition(0);

    for (int i = 0; i < length; i++) {
        char c = string.charAt(i);

        if (c == '&' && i < length - 6 && string.charAt(i + 1) == '#' && string.charAt(i + 6) == ';') {
            String codepointString = string.substring(i + 2, i + 6);

            pos.setIndex(0);
            Number value = parser.parse(codepointString, pos);

            boolean isDecimal = pos.getIndex() == codepointString.length();
            if (isDecimal) {
                int codepoint = value.intValue();
                if (codepoint >= 9728 && codepoint <= 9999) {
                    out.append((char)codepoint);
                    i += 6;
                    continue;
                }
            }
        }

        out.append(c);
    }

    return out.toString();
}

You can make parser and pos global to prevent creating new objects on each call, but watch out as they are not thread-safe.您可以将parser和pos设为全局以防止在每次调用时创建新对象，但要注意它们不是线程安全的。 (and it's not good to prematurely optimize) （而且过早优化也不好）

Answer 2

You need to render the page in UTF-8, and say in the form that the server accepts UTF-8 in the form data.你需要在UTF-8中渲染页面，并在表单数据中说服务器接受UTF-8。 (Otherwise &#...; entities are sent instead of Unicode symbols.) （否则发送&#...;实体而不是 Unicode 符号。）

<form action="..." accept-charset="ISO-8859-1">

In HTML 5:在 HTML 5 中：

<meta charset="UTF-8">

Older HTML:旧版 HTML：

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Setting the response header accordingly should be done too:也应该相应地设置响应 header ：

Content-Type: text/html; charset=UTF-8

response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");

Answer 3

There are a few libraries that can un-escape HTML entities eg JSoup Parser.unescapeEntities() method.有一些库可以取消转义 HTML 实体，例如 JSoup Parser.unescapeEntities()方法。

If you want to simply remove emojis take a look at this answer which uses a white-list filter approach :如果您想简单地删除表情符号，请查看使用白名单过滤器方法的这个答案：

String input = "Last Validations ⛔ 0 ⚡ 1 ⚠ 6 ? 0";
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = input.replaceAll(characterFilter,""); 
System.out.println(emotionless); // Last Validations  0  1  6 ? 0

如何去除 Unicode 十进制值 Unicode Java 中的字符串中的杂项符号

问题描述

3 个解决方案

解决方案1
1 2019-11-19 15:34:39

解决方案2
1 2019-11-19 16:02:13

解决方案3
0 2019-11-19 15:38:56

如何去除 Unicode 十进制值 Unicode Java 中的字符串中的杂项符号

问题描述

3 个解决方案

解决方案1 1 2019-11-19 15:34:39

解决方案2 1 2019-11-19 16:02:13

解决方案3 0 2019-11-19 15:38:56

解决方案1
1 2019-11-19 15:34:39

解决方案2
1 2019-11-19 16:02:13

解决方案3
0 2019-11-19 15:38:56