简体   繁体   English

如何将字节流转换为UTF-8字符?

[英]How to convert stream of bytes to UTF-8 characters?

I need to convert a stream of bytes to a line of UTF-8 characters. 我需要将字节流转换为UTF-8字符行。 The only character that is important for me in that line is the last one. 在这一行中对我来说唯一重要的角色是最后一个。 And this conversion should happen in a cycle, so the performance is very important. 这种转换应该在一个周期内进行,因此性能非常重要。 A simple and inefficient approach would be: 一种简单而低效的方法是:

public class Foo {
  private ByteArrayOutputStream buffer = new ByteArrayOutputStream();
  void next(byte input) {
    this.buffer.write(input);
    String text = this.buffer.toString("UTF-8"); // this is time consuming
    if (text.charAt(text.length() - 1) == THE_CHAR_WE_ARE_WAITING_FOR) {
      System.out.println("hurray!");
      this.buffer.reset();
    }   
  }
}

Conversion of byte array to string happens on every input byte, which is, in my understanding, very ineffective. 字节数组到字符串的转换发生在每个输入字节上,据我所知,这是非常无效的。 Is it possible to do it somehow else to preserve the results of bytes-to-text conversion from a previous cycle? 是否有可能以其他方式保留前一周期的字节到文本转换结果?

You can use a simple class to keep track of the characters, and only convert when you got a full UTF8 sequence. 您可以使用一个简单的类来跟踪字符,只有在获得完整的UTF8序列时才能进行转换。 Here's a sample (with no error checking which you may want to add) 这是一个示例(没有错误检查,您可能想要添加)

class UTF8Processor {
    private byte[] buffer = new byte[6];
    private int count = 0;

    public String processByte(byte nextByte) throws UnsupportedEncodingException {
        buffer[count++] = nextByte;
        if(count == expectedBytes())
        {
            String result = new String(buffer, 0, count, "UTF-8");
            count = 0;
            return result;
        }
        return null;
    }

    private int expectedBytes() {
        int num = buffer[0] & 255;
        if(num < 0x80) return 1;
        if(num < 0xe0) return 2;
        if(num < 0xf0) return 3;
        if(num < 0xf8) return 4;
        return 5;
    }
}

class Bop
{
    public static void main (String[] args) throws java.lang.Exception
    {
        // Create test data.
        String str = "Hejsan åäö/漢ya";
        byte[] bytes = str.getBytes("UTF-8");

        String ch;

        // Processes byte by byte, returns a valid UTF8 char when 
        //there is a complete one to get.

        UTF8Processor processor = new UTF8Processor();

        for(int i=0; i<bytes.length; i++)
        {
            if((ch = processor.processByte(bytes[i])) != null)
                System.out.println(ch);
        }
    }
}

Based on the comment: 根据评论:

It's line feed (0x0A) 换行(0x0A)

Your next method can just check: 您的next方法可以检查:

if ((char)input == THE_CHAR_WE_ARE_WAITING_FOR) {
    //whatever your logic is.
}

You don't have to do any conversion for characters < 128. 您不必对字符<128进行任何转换。

You have two options: 您有两种选择:

  • If the codepoint you are interested in is something simple (in UTF-8 terms) as a codepoint below 128, then a simple cast from byte to char is possible. 如果你感兴趣的代码点是一个简单的(用UTF-8术语)作为低于128的代码点,那么从bytechar的简单转换是可能的。 Lookup the encoding rules on Wikipadia: UTF-8 for the reason why this works. 查找Wikipadia上的编码规则:UTF-8 ,了解其工作原理。

  • If this is not possible, you can take a look at the Charset class which is the root of Java's encoding/decoding library. 如果无法做到这一点,您可以查看Charset类,它是Java编码/解码库的根。 Here you will find CharsetDecoder which you can feed N bytes and get back M characters. 在这里你可以找到CharsetDecoder ,你可以提供N个字节并返回M个字符。 The general case is N != M . 一般情况是N!= M. However you will have to deal with ByteBuffer and CharBuffer . 但是你必须处理ByteBufferCharBuffer

Wrap your byte-getting code in an InputStream and pass that to an InputStreamReader. 将您的字节获取代码包装在InputStream中并将其传递给InputStreamReader。

    InputStreamReader isr = new InputStreamReader(new InputStream() {
        @Override
        public int read() throws IOException {
            return xx();// wherever you get your data from.
        }
    }, "UTF-8");
    while(true) {
        try {
            if(isr.read() == THE_CHAR_WE_ARE_WAITING_FOR)
                System.out.println("hurray!");
        } catch(IOException e) {
            e.printStackTrace(); 
        }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM