简体   繁体   中英

How can I decode a large, multi-byte string file progressively in Java?

I have a program that may need to process large files possibly containing multi-byte encodings. My current code for doing this has the problem that creates a memory structure to hold the entire file, which can cause an out of memory error if the file is large:

Charset charset = Charset.forName( "UTF-8" );
CharsetDecoder decoder = charset.newDecoder();
FileInputStream fis = new FileInputStream( file );
FileChannel fc = fis.getChannel();
int lenFile = (int)fc.size();
MappedByteBuffer bufferFile = fc.map( FileChannel.MapMode.READ_ONLY, 0, lenFile );
CharBuffer cb = decoder.decode( bufferFile );
// process character buffer
fc.close();

The problem is that if I chop up the file byte contents using a smaller buffer and feed it piecemeal to the decoder, then the buffer could end in the middle of a multi-byte sequence. How should I cope with this problem?

It is as easy as using a Reader .

A CharsetDecoder is indeed the underlying mechanism which allows the decoding of bytes into chars. In short, you could say that:

// Extrapolation...
byte stream --> decoding       --> char stream
InputStream --> CharsetDecoder --> Reader

The less known fact is that most (but not all... See below) default decoders in the JDK (such as those created from a FileReader for instance, or an InputStreamReader with only a charset) will have a policy of CodingErrorAction.REPLACE . The effect is to replace any invalid byte sequence in the input with the Unicode replacement character (yes, that infamous ).

Now, if you are concerned about the ability for "bad characters" to slip in, you can also select to have a policy of REPORT . You can do that when reading a file, too, as follows; this will have the effect of throwing a MalformedInputException on any malformed byte sequence:

// This is 2015. File is obsolete.
final Path path = Paths.get(...);
final CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

try (
    final InputStream in = Files.newInputStream(path);
    final Reader reader = new InputStreamReader(in, decoder);
) {
    // use the reader
}

ONE EXCEPTION to that default replace action appears in Java 8: Files.newBufferedReader(somePath) will try and read in UTF-8, always, and with a default action of REPORT .

Open and read the file as a text file, so the file reader will do the separation into characters for you. If the file has lines, just read it line by line. If it isn't split into lines, then read in in blocks of 1,000 (or whatever) characters. Let the file library deal with the low-level stuff of converting the UTF multi-byte sequences into characters.

@fge, I didn't know about the report option - cool. @Tyler, the trick, I think, is using the BufferedReader's read() method: Excerpt from here: https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#read%28char[],%20int,%20int%29

public int read(char[] cbuf,
       int off,
       int len)
         throws IOException

Here is some example output (code below):

read #1, found 32 chars
read #2, found 32 chars
read #3, found 32 chars
read #4, found 32 chars
read #80, found 32 chars
...
read #81, found 32 chars
read #82, found 7 chars
Done, read total=2599 chars, readcnt=82

Note on the output above it happened to end with the last '7' characters; you can adjust the buffer array size to process whatever "chunk" size you want... this is just an example to suggest you wont' have to worry about getting stuck somewhere "mid-byte" in a multi-byte UTF8 character.

import java.io.*;

class Foo {
   public static void main( String args[] ) throws Exception {
      String encoding = "UTF8";
      String inFilename = "unicode-example-utf8.txt";
      // Test file from http://www.i18nguy.com/unicode/unicode-example-intro.htm
      // Specifically the Example Data, CSV format:
      //     http://www.i18nguy.com/unicode/unicode-example-utf8.zip
      char buff[] = new char[ 32 ]; // or whatever size...
      // I know the readers  can be combined to just nest the temp instances,
      // for an  example i think it is easier to parse the structure
      // with each reader explicitly declared.
      FileInputStream finstream = new FileInputStream( inFilename );
      InputStreamReader instream = new InputStreamReader( finstream, encoding );
      BufferedReader in = new BufferedReader( instream );
      int n;
      long total = 0;
      long readcnt = 0;
      while( -1 != (n = in.read( buff, 0, buff.length ) ) ) {
         total += n;
         ++readcnt;
         System.out.println("read #"+readcnt+", found "+n+" chars ");
      }
      System.out.println( "Done, read total="+total+" chars, readcnt="+readcnt );
      in.close();
   }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM