简体   繁体   中英

Character encoding converting windows-1252 input file to utf-8 output file

I am processing a HTML doc that I turned into HTML from Word's save option (programmatically). This HTML text file is windows-1252 encoded. (Yes, I've read quite a bit about bytes and Unicode code points, I know code points beyond 128 can be 2,3, and up to 6 bytes, etc.) I added quite a few unprintable chars to my Word doc Template and wrote code to evaluate each CHARACTER (decimal equivalent). FOR SURE I know I don't want to allow decimal #160, which is MS Word's translation into HTML of the non-breaking space. I'm anticipating that in the near future people will put more of these "illegal" constructs into the Templates and I'll need to trap them and deal with them (because they will cause funny viewing in the browser towit: (this is in a dump to Eclipse console, I put all the doc lines into a map)

 DataObj.paragraphMap  : {1=, 2=Introduction and Learning Objective, 3=? ©®™§¶…‘’“”????, 4=, 5=, 6=, 
   7=This is paragraph 1 no formula, 8=, 

I replaced decimal #160 with #32 (regular space) and then write the Characters to a new file using UTF-8 encoding- so is my thinking sound, can I use this technique to either replace or decide not to write back a particular character using decimal equivalency? I wanted to avoid String because I can take on multiple Docs and don't want to run out of memory .... so I'm doing it in files ...

 public static void convert1252toUFT8(String fileName) throws IOException {   
    File f = new File(fileName);
    Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(f), "windows-1252"));
    OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileName + "x"), StandardCharsets.UTF_8); 
    List<Character> charsList = new ArrayList<>(); 
    int count = 0;

    try {
        int intch;
        while ((intch = r.read()) != -1) {   //reads a single character and returns integer equivalent
            int ch = (char)intch;
            //System.out.println("intch=" + intch + " ch=" + ch + " isValidCodePoint()=" + Character.isValidCodePoint(ch) 
            //+ " isDefined()=" + Character.isDefined(ch) + " charCount()=" + Character.charCount(ch) + " char=" 
            //+ (char)intch);

            if (Character.isValidCodePoint(ch)) {
                if (intch == 160 ) {
                    intch = 32;
                }
                charsList.add((char)intch);
                count++;
            } else {
                System.out.println("unexpected character found but not dealt with.");
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        System.out.println("Chars read in=" + count + " Chars read out=" + charsList.size());
        for(Character item : charsList) {
            writer.write((char)item);
        }
        writer.close();
        r.close();
        charsList = null;

        //check that #160 was replaced File 
        //f2 = new File(fileName + "x"); 
        //Reader r2 = new BufferedReader(new InputStreamReader(new FileInputStream(f2), "UTF-8")); 
        //int intch2;
        //while ((intch2 = r2.read()) != -1) { //reads a single character and returns integer equivalent 
        //int ch2 = (char)intch2; 
        //System.out.println("intch2=" + intch2 + " ch2=" + ch2 + " isValidCodePoint()=" +
        //Character.isValidCodePoint(ch2) + " char=" + (char)intch2); 
        //}

    }   
}

First, there is nothing wrong with an HTML page being in a different encoding than UTF-8. In fact, it's very likely that the document contains a line like

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

in its header, which renders the document invalid when you change the file's character encoding without adapting this header line.

Further, there is no reason to replace codepoint #160 in the document, as it is Unicode's standard non breaking space character , which is the reason why &#160; is a valid alternative to &nbsp;and if the document's charset supports this codepoint, using it directly is valid too.

Your attempt to avoid strings is a typical case of premature optimization . The lack of actual measurement leads to a solution like ArrayList<Character> which consumes twice¹ the memory of a String .

If you want to copy or convert a file, you shouldn't keep the entire file in memory. Just write the data back before reading the next, but for the sake of efficiency, use some buffer rather than reading and writing a single character a time. Further, you should use the try-with-resources statement for managing the input and output resources.

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
        BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
            StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

There is no point in testing the validity of the characters, as the decoder doesn't produce invalid codepoints. The decoder is by default configured to throw an exception on invalid bytes. The other options are to replace invalid input with replacement characters (like ) or to skip them, but it will never produce invalid characters.

The amount of memory needed during the operation is determined by the buffer size, though the code above uses a reader and writer which have buffers on their own. Still the total amount of memory used for the operation is independent from the file size.

A solution only using your explicitly specified buffer would look like

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(Reader br = Channels.newReader(Files.newByteChannel(in), "windows-1252");
        Writer bw = Channels.newWriter(
            Files.newByteChannel(out, WRITE, CREATE, TRUNCATE_EXISTING),
            StandardCharsets.UTF_8)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

This would also be the starting point for implementing different handling of invalid input, eg to just remove all invalid input bytes you only have to change the beginning of the method to

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    CharsetDecoder dec = Charset.forName("windows-1252")
            .newDecoder().onUnmappableCharacter(CodingErrorAction.IGNORE);
    try(Reader br = Channels.newReader(Files.newByteChannel(in), dec, -1);
…

Note that for a successful conversion, the number of characters read and written is the same, but only for the input encoding Windows-1252, the number of characters is identical to the number of bytes, ie file size (when the entire file is valid).

This conversion code example was only for completion, as said at the beginning, converting an HTML page without adapting the header might invalidate the file and isn't even necessary.

¹ depending on the implementation, even four times

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM