简体   繁体   English

字符编码将 windows-1252 输入文件转换为 utf-8 输出文件

[英]Character encoding converting windows-1252 input file to utf-8 output file

I am processing a HTML doc that I turned into HTML from Word's save option (programmatically).我正在处理我从 Word 的保存选项(以编程方式)转换为 HTML 的 HTML 文档。 This HTML text file is windows-1252 encoded.这个 HTML 文本文件是 windows-1252 编码的。 (Yes, I've read quite a bit about bytes and Unicode code points, I know code points beyond 128 can be 2,3, and up to 6 bytes, etc.) I added quite a few unprintable chars to my Word doc Template and wrote code to evaluate each CHARACTER (decimal equivalent). (是的,我已经阅读了很多关于字节和 Unicode 代码点的内容,我知道超过 128 的代码点可以是 2,3,最多 6 个字节等。)我在 Word 文档模板中添加了很多不可打印的字符并编写代码来评估每个 CHARACTER(十进制等效值)。 FOR SURE I know I don't want to allow decimal #160, which is MS Word's translation into HTML of the non-breaking space.当然,我知道我不想允许十进制 #160,这是 MS Word 翻译成 HTML 的不间断空格。 I'm anticipating that in the near future people will put more of these "illegal" constructs into the Templates and I'll need to trap them and deal with them (because they will cause funny viewing in the browser towit: (this is in a dump to Eclipse console, I put all the doc lines into a map)我预计在不久的将来人们会将更多这些“非法”构造放入模板中,我将需要捕获它们并处理它们(因为它们会在浏览器中引起有趣的查看:(这是在转储到 Eclipse 控制台,我将所有文档行放入地图中)

 DataObj.paragraphMap  : {1=, 2=Introduction and Learning Objective, 3=? ©®™§¶…‘’“”????, 4=, 5=, 6=, 
   7=This is paragraph 1 no formula, 8=, 

I replaced decimal #160 with #32 (regular space) and then write the Characters to a new file using UTF-8 encoding- so is my thinking sound, can I use this technique to either replace or decide not to write back a particular character using decimal equivalency?我用 #32(常规空格)替换了十进制 #160,然后使用 UTF-8 编码将字符写入新文件 - 我的想法也是如此,我可以使用这种技术来替换或决定不写回特定字符吗使用十进制等价? I wanted to avoid String because I can take on multiple Docs and don't want to run out of memory .... so I'm doing it in files ...我想避免使用字符串,因为我可以处理多个文档并且不想耗尽内存......所以我在文件中进行......

 public static void convert1252toUFT8(String fileName) throws IOException {   
    File f = new File(fileName);
    Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(f), "windows-1252"));
    OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileName + "x"), StandardCharsets.UTF_8); 
    List<Character> charsList = new ArrayList<>(); 
    int count = 0;

    try {
        int intch;
        while ((intch = r.read()) != -1) {   //reads a single character and returns integer equivalent
            int ch = (char)intch;
            //System.out.println("intch=" + intch + " ch=" + ch + " isValidCodePoint()=" + Character.isValidCodePoint(ch) 
            //+ " isDefined()=" + Character.isDefined(ch) + " charCount()=" + Character.charCount(ch) + " char=" 
            //+ (char)intch);

            if (Character.isValidCodePoint(ch)) {
                if (intch == 160 ) {
                    intch = 32;
                }
                charsList.add((char)intch);
                count++;
            } else {
                System.out.println("unexpected character found but not dealt with.");
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        System.out.println("Chars read in=" + count + " Chars read out=" + charsList.size());
        for(Character item : charsList) {
            writer.write((char)item);
        }
        writer.close();
        r.close();
        charsList = null;

        //check that #160 was replaced File 
        //f2 = new File(fileName + "x"); 
        //Reader r2 = new BufferedReader(new InputStreamReader(new FileInputStream(f2), "UTF-8")); 
        //int intch2;
        //while ((intch2 = r2.read()) != -1) { //reads a single character and returns integer equivalent 
        //int ch2 = (char)intch2; 
        //System.out.println("intch2=" + intch2 + " ch2=" + ch2 + " isValidCodePoint()=" +
        //Character.isValidCodePoint(ch2) + " char=" + (char)intch2); 
        //}

    }   
}

First, there is nothing wrong with an HTML page being in a different encoding than UTF-8.首先,HTML 页面采用与 UTF-8 不同的编码并没有错。 In fact, it's very likely that the document contains a line like事实上,文档很可能包含这样一行

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

in its header, which renders the document invalid when you change the file's character encoding without adapting this header line.在其标题中,当您更改文件的字符编码而不调整此标题行时,这会使文档无效。

Further, there is no reason to replace codepoint #160 in the document, as it is Unicode's standard non breaking space character , which is the reason why &#160;此外,没有理由替换文档中的代码点 #160,因为它是 Unicode 的标准非中断空格字符,这就是为什么&#160; is a valid alternative to &nbsp;&nbsp;的有效替代品and if the document's charset supports this codepoint, using it directly is valid too.如果文档的字符集支持这个代码点,直接使用它也是有效的。

Your attempt to avoid strings is a typical case of premature optimization .您尝试避免使用字符串是过早优化的典型案例。 The lack of actual measurement leads to a solution like ArrayList<Character> which consumes twice¹ the memory of a String .缺乏实际测量导致了像ArrayList<Character>这样的解决方案,它消耗了String两倍的内存。

If you want to copy or convert a file, you shouldn't keep the entire file in memory.如果要复制或转换文件,则不应将整个文件保存在内存中。 Just write the data back before reading the next, but for the sake of efficiency, use some buffer rather than reading and writing a single character a time.只需在读取下一个之前将数据写回,但为了效率起见,使用一些缓冲区而不是一次读取和写入单个字符。 Further, you should use the try-with-resources statement for managing the input and output resources.此外,您应该使用try-with-resources 语句来管理输入和输出资源。

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
        BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
            StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

There is no point in testing the validity of the characters, as the decoder doesn't produce invalid codepoints.测试字符的有效性没有意义,因为解码器不会产生无效的代码点。 The decoder is by default configured to throw an exception on invalid bytes.默认情况下,解码器配置为对无效字节抛出异常。 The other options are to replace invalid input with replacement characters (like ) or to skip them, but it will never produce invalid characters.其他选项是用替换字符(如 )替换无效输入或跳过它们,但它永远不会产生无效字符。

The amount of memory needed during the operation is determined by the buffer size, though the code above uses a reader and writer which have buffers on their own.操作期间所需的内存量由缓冲区大小决定,尽管上面的代码使用了各自具有缓冲区的读取器和写入器。 Still the total amount of memory used for the operation is independent from the file size.用于操作的内存总量仍然与文件大小无关。

A solution only using your explicitly specified buffer would look like仅使用您明确指定的缓冲区的解决方案看起来像

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(Reader br = Channels.newReader(Files.newByteChannel(in), "windows-1252");
        Writer bw = Channels.newWriter(
            Files.newByteChannel(out, WRITE, CREATE, TRUNCATE_EXISTING),
            StandardCharsets.UTF_8)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

This would also be the starting point for implementing different handling of invalid input, eg to just remove all invalid input bytes you only have to change the beginning of the method to这也将是实现对无效输入的不同处理的起点,例如,只需删除所有无效输入字节,您只需将方法的开头更改为

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    CharsetDecoder dec = Charset.forName("windows-1252")
            .newDecoder().onUnmappableCharacter(CodingErrorAction.IGNORE);
    try(Reader br = Channels.newReader(Files.newByteChannel(in), dec, -1);
…

Note that for a successful conversion, the number of characters read and written is the same, but only for the input encoding Windows-1252, the number of characters is identical to the number of bytes, ie file size (when the entire file is valid).注意,对于一次成功的转换,读取和写入的字符数是相同的,但仅对于输入编码Windows-1252,字符数与字节数相同,即文件大小(当整个文件有效时)。

This conversion code example was only for completion, as said at the beginning, converting an HTML page without adapting the header might invalidate the file and isn't even necessary.这个转换代码示例只是为了完成,正如开头所说,在不调整标题的情况下转换 HTML 页面可能会使文件无效,甚至没有必要。

¹ depending on the implementation, even four times ¹取决于实施,甚至四次

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM