简体   繁体   English

如何在Java中将UTF-8表示解析为String?

[英]How to parse UTF-8 representation to String in Java?

Given the following code: 给出以下代码:

String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");

String result = convertToEffectiveString(tmp); // result contain now "hello\n"

Does the JDK already provide some classes for doing this ? JDK是否已经为此提供了一些类? Is there a libray that does this ? 有没有这样做的图书? (preferably under maven) (最好在maven下)

I have tried with ByteArrayOutputStream with no success. 我尝试使用ByteArrayOutputStream但没有成功。

Firstly, are you just trying to parse a string literal, or is tmp going to be some user-entered data? 首先,你只是试图解析一个字符串文字,还是tmp将成为一些用户输入的数据?

If this is going to be a string literal (ie hard-coded string), it can be encoded using Unicode escapes. 如果这将是一个字符串文字(即硬编码字符串),它可以使用Unicode转义编码。 In your case, this just means using single backslashes instead of double backslashes: 在您的情况下,这只是意味着使用单反斜杠而不是双反斜杠:

String result = "\u0068\u0065\u006c\u006c\u006f\u000a";

If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method. 但是,如果您需要使用Java的字符串解析规则来解析用户输入,那么一个好的起点可能是Apache Commons Lang的StringEscapeUtils.unescapeJava()方法。

This works, but only with ASCII. 这有效,但只能使用ASCII。 If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). 如果你在ASCCI范围之外使用unicode字符,那么你将遇到问题(因为每个字符被填充到一个字节中,而不是UTF-8允许的完整字)。 You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments). 您可以进行下面的类型转换,因为您知道如果您保证输入基本上是ASCII(如您在评论中提到的那样),UTF-8不会溢出一个字节。

package sample;

import java.io.UnsupportedEncodingException;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";

            String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
            byte[] utf8 = new byte[arr.length];

            int index=0;
            for (String ch : arr) {
                utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
            }

            String newStr = new String(utf8, "UTF-8");
            System.out.println(newStr);

        }
        catch (UnsupportedEncodingException e) {
            // handle the UTF-8 conversion exception
        }
    }
}

Here is another solution that fixes the issue of only working with ASCII characters. 这是另一个解决仅使用ASCII字符的问题的解决方案。 This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. 这将适用于UTF-8范围内的任何unicode字符,而不是仅在范围的前8位中使用ASCII。 Thanks to deceze for the questions. 感谢deceze的问题。 You made me think more about the problem and solution. 你让我更多地思考问题和解决方案。

package sample;

import java.io.UnsupportedEncodingException;
import java.util.ArrayList;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";

            ArrayList<Byte> arrList = new ArrayList<Byte>();
            String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");

            for (String c : codes) {

                int code = Integer.parseInt(c,HEXADECIMAL);
                byte[] bytes = intToByteArray(code);

                for (byte b : bytes) {
                    if (b != 0) arrList.add(b);
                }
            }

            byte[] utf8 = new byte[arrList.size()];
            for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);

            str = new String(utf8, "UTF-8");
            System.out.println(str);
        }
        catch (UnsupportedEncodingException e) {
            // handle the exception when
        }
    }

    // Takes a 4 byte integer and and extracts each byte
    public static final byte[] intToByteArray(int value) {
        return new byte[] {
                (byte) (value >>> 24),
                (byte) (value >>> 16),
                (byte) (value >>> 8),
                (byte) (value)
        };
    }
}

I'm sure there must be a better way, but using just the JDK: 我确信必须有更好的方法,但只使用JDK:

public static String handleEscapes(final String s)
{
    final java.util.Properties props = new java.util.Properties();
    props.setProperty("foo", s);
    final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
    try
    {
        props.store(baos, null);
        final String tmp = baos.toString().replace("\\\\", "\\");
        props.load(new java.io.StringReader(tmp));
    }
    catch(final java.io.IOException ioe) // shouldn't happen
        { throw new RuntimeException(ioe); }
    return props.getProperty("foo");
}

uses java.util.Properties.load(java.io.Reader) to process the backslash-escapes (after first using java.util.Properties.store(java.io.OutputStream, java.lang.String) to backslash-escape anything that would cause problems in a properties-file, and then using replace("\\\\\\\\", "\\\\") to reverse the backslash-escaping of the original backslashes). 使用java.util.Properties.load(java.io.Reader)来处理反斜杠转义(首先使用java.util.Properties.store(java.io.OutputStream, java.lang.String)来反斜杠 - 转义任何东西这会导致属性文件出现问题,然后使用replace("\\\\\\\\", "\\\\")来反转原始反斜杠的反斜杠转义。

(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.) (免责声明:尽管我测试了所有我能想到的案例,但仍有一些我没有想到的案例。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM