简体   繁体   English

如何在Java中逐步解码大型多字节字符串文件?

[英]How can I decode a large, multi-byte string file progressively in Java?

I have a program that may need to process large files possibly containing multi-byte encodings. 我有一个程序可能需要处理可能包含多字节编码的大文件。 My current code for doing this has the problem that creates a memory structure to hold the entire file, which can cause an out of memory error if the file is large: 我当前执行此操作的代码有一个问题,即创建一个内存结构来保存整个文件,如果文件很大,可能会导致内存不足错误:

Charset charset = Charset.forName( "UTF-8" );
CharsetDecoder decoder = charset.newDecoder();
FileInputStream fis = new FileInputStream( file );
FileChannel fc = fis.getChannel();
int lenFile = (int)fc.size();
MappedByteBuffer bufferFile = fc.map( FileChannel.MapMode.READ_ONLY, 0, lenFile );
CharBuffer cb = decoder.decode( bufferFile );
// process character buffer
fc.close();

The problem is that if I chop up the file byte contents using a smaller buffer and feed it piecemeal to the decoder, then the buffer could end in the middle of a multi-byte sequence. 问题是如果我使用较小的缓冲区切断文件字节内容并将其零碎地提供给解码器,则缓冲区可以在多字节序列的中间结束。 How should I cope with this problem? 我应该如何应对这个问题?

It is as easy as using a Reader . 它就像使用Reader一样简单。

A CharsetDecoder is indeed the underlying mechanism which allows the decoding of bytes into chars. CharsetDecoder确实是允许将字节解码为字符的底层机制。 In short, you could say that: 简而言之,你可以说:

// Extrapolation...
byte stream --> decoding       --> char stream
InputStream --> CharsetDecoder --> Reader

The less known fact is that most (but not all... See below) default decoders in the JDK (such as those created from a FileReader for instance, or an InputStreamReader with only a charset) will have a policy of CodingErrorAction.REPLACE . 鲜为人知的事实是,JDK中的大多数(但不是全部......见下文)默认解码器(例如从FileReader创建的那些,或者只有一个charset的InputStreamReader )将具有CodingErrorAction.REPLACE的策略。 The effect is to replace any invalid byte sequence in the input with the Unicode replacement character (yes, that infamous ). 效果是用Unicode替换字符替换输入中的任何无效字节序列(是的,臭名昭着的 )。

Now, if you are concerned about the ability for "bad characters" to slip in, you can also select to have a policy of REPORT . 现在,如果您担心“坏人物”的入侵能力,您也可以选择制定REPORT政策。 You can do that when reading a file, too, as follows; 您也可以在阅读文件时执行此操作,如下所示; this will have the effect of throwing a MalformedInputException on any malformed byte sequence: 这将导致在任何格式MalformedInputException字节序列上抛出MalformedInputException

// This is 2015. File is obsolete.
final Path path = Paths.get(...);
final CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

try (
    final InputStream in = Files.newInputStream(path);
    final Reader reader = new InputStreamReader(in, decoder);
) {
    // use the reader
}

ONE EXCEPTION to that default replace action appears in Java 8: Files.newBufferedReader(somePath) will try and read in UTF-8, always, and with a default action of REPORT . Java 8中出现了对该默认替换操作的一个例外: Files.newBufferedReader(somePath)将尝试以UTF-8读取,并始终使用REPORT的默认操作。

Open and read the file as a text file, so the file reader will do the separation into characters for you. 打开并将文件作为文本文件读取,因此文件阅读器将为您分隔字符。 If the file has lines, just read it line by line. 如果文件有行,只需逐行读取。 If it isn't split into lines, then read in in blocks of 1,000 (or whatever) characters. 如果它没有分成行,则以1,000(或其他)字符的块读入。 Let the file library deal with the low-level stuff of converting the UTF multi-byte sequences into characters. 让文件库处理将UTF多字节序列转换为字符的低级事务。

@fge, I didn't know about the report option - cool. @fge,我不知道报告选项 - 很酷。 @Tyler, the trick, I think, is using the BufferedReader's read() method: Excerpt from here: https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#read%28char[],%20int,%20int%29 @Tyler,我认为,这个技巧是使用BufferedReader的read()方法:摘自此处: https//docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#read% 28char [],%20int,%20int%29

public int read(char[] cbuf,
       int off,
       int len)
         throws IOException

Here is some example output (code below): 这是一些示例输出(下面的代码):

read #1, found 32 chars
read #2, found 32 chars
read #3, found 32 chars
read #4, found 32 chars
read #80, found 32 chars
...
read #81, found 32 chars
read #82, found 7 chars
Done, read total=2599 chars, readcnt=82

Note on the output above it happened to end with the last '7' characters; 关于它上面的输出的注意事件恰好以最后的'7'字符结尾; you can adjust the buffer array size to process whatever "chunk" size you want... this is just an example to suggest you wont' have to worry about getting stuck somewhere "mid-byte" in a multi-byte UTF8 character. 你可以调整缓冲区数组大小来处理你想要的任何“块”大小...这只是一个例子,建议你不必担心在多字节UTF8字符中被卡在“中间字节”的某个地方。

import java.io.*;

class Foo {
   public static void main( String args[] ) throws Exception {
      String encoding = "UTF8";
      String inFilename = "unicode-example-utf8.txt";
      // Test file from http://www.i18nguy.com/unicode/unicode-example-intro.htm
      // Specifically the Example Data, CSV format:
      //     http://www.i18nguy.com/unicode/unicode-example-utf8.zip
      char buff[] = new char[ 32 ]; // or whatever size...
      // I know the readers  can be combined to just nest the temp instances,
      // for an  example i think it is easier to parse the structure
      // with each reader explicitly declared.
      FileInputStream finstream = new FileInputStream( inFilename );
      InputStreamReader instream = new InputStreamReader( finstream, encoding );
      BufferedReader in = new BufferedReader( instream );
      int n;
      long total = 0;
      long readcnt = 0;
      while( -1 != (n = in.read( buff, 0, buff.length ) ) ) {
         total += n;
         ++readcnt;
         System.out.println("read #"+readcnt+", found "+n+" chars ");
      }
      System.out.println( "Done, read total="+total+" chars, readcnt="+readcnt );
      in.close();
   }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM