简体   繁体   English

如何将UTF-8文字转换为UTF-8字符?

[英]How can I convert UTF-8 literals, into its UTF-8 character?

I have a bunch of text files that were encoded in UTF-8 . 我有一堆用UTF-8编码的文本文件。 The text inside the files look like this: \\x6c\\x69b/\\x62\\x2f\\x6d\\x69nd/m\\x61x\\x2e\\x70h\\x70 . 文件中的文本如下所示: \\x6c\\x69b/\\x62\\x2f\\x6d\\x69nd/m\\x61x\\x2e\\x70h\\x70

I've copied all these text files and placed them into a directory /convert/ . 我已经复制了所有这些文本文件,并将它们放在目录/convert/

I need to read each file and convert the encoded literals into characters, then save the file. 我需要阅读每个文件并将编码的文字转换为字符,然后保存文件。 filename.converted.txt

What would be the smartest approach to do this? 什么是最聪明的方法来做到这一点? What can I do to convert to the new text? 如何转换为新文本? Is there a function for handling Unicode text to convert between the literal to character types? 是否有用于处理Unicode文本的函数,以在文字和字符类型之间进行转换? Should I be using a different programming language for this? 我是否应该为此使用其他编程语言?

This is what I have at the moment: 这是我目前所拥有的:

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;

public class decode {
    public static void main(String args[]) {
        File directory = new File("C:/convert/");
        String[] files = directory.list();
        boolean success = false;
        for (String file : files) {
            System.out.println("Processing \"" + file + "\"");

            //TODO read each file and convert them into characters
            success = true;

            if (success) {
                System.out.println("Successfully converted \"" + file + "\"");
            } else {
                System.out.println("Failed to convert \"" + file + "\"");
            }

            //save file
            if (success) {
                try {
                    FileWriter open = new FileWriter("C:/convert/" + file + ".converted.txt");
                    BufferedWriter write = new BufferedWriter(open);
                    write.write("TODO: write converted text into file");
                    write.close();
                    System.out.println("Successfully saved \"" + file + "\" conversion.");
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

(It looks like there's some confusion about what you mean - this answer assumes the input file is entirely in ASCII, and uses "\\x" to hex-encode any bytes which aren't in the ASCII range.) (您的意思似乎有些困惑-此答案假设输入文件完全是ASCII的,并使用“ \\ x”对不在ASCII范围内的任何字节进行十六进制编码。)

It sounds to me like the UTF-8 part of it is actually irrelevant. 在我看来,它的UTF-8部分实际上无关紧要。 You can treat it as opaque binary data for output. 您可以将其视为不透明的二进制数据进行输出。 Assuming the input file is entirely ASCII: 假设输入文件完全是ASCII:

  • Open the input file as text (eg using FileInputStream wrapped in InputStreamReader specifying an encoding of "US-ASCII") 以文本形式打开输入文件(例如,使用包装在InputStreamReader FileInputStream指定“ US-ASCII”的编码)
  • Open the output file as binary (eg using FileOutputStream ) 以二进制形式打开输出文件(例如,使用FileOutputStream
  • Read each character from the input 从输入中读取每个字符
  • Is it '\\'? 是吗 '\\'?
    • If not, write the character's ASCII value to the output stream (just case from char to byte ) 如果不是,则将字符的ASCII值写入输出流(从charbyte只是大小写)
    • What's the next character? 下一个角色是什么?
    • If it's 'x', read the next two characters, convert them from hex to a byte (there's lots of code around to do this part), and write that byte to the output stream 如果是“ x”,请读取接下来的两个字符,将其从十六进制转换为字节(周围有很多代码可以完成此部分),然后将该字节写入输出流
    • If it's '\\', write the ASCII value for '\\' to the output stream 如果为'\\',则将'\\'的ASCII值写入输出流
    • Otherwise, possibly throw an exception indicating failure 否则,可能引发异常以指示失败
  • Loop until you've exhausted the input file 循环播放,直到用尽输入文件
  • Close both files in finally blocks 在finally块中关闭两个文件

You'll then have a "normal" UTF-8 file which should be readable by any text editor which supports UTF-8. 然后,您将获得一个“普通” UTF-8文件,任何支持UTF-8的文本编辑器都可以读取该文件。

java.io.InputStreamReader can be used to convert an input stream from an arbitrary charset into Java char s. java.io.InputStreamReader可用于将输入流从任意字符集转换为Java char I'm not exactly sure how you want to write it back out, though. 不过,我不确定您要如何将其写回。 Do you want non-ASCII characters to be written out as ASCII Unicode escape sequences? 您是否希望将非ASCII字符写为ASCII Unicode转义序列?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM