[英]How can I convert UTF-8 literals, into its UTF-8 character?
I have a bunch of text files that were encoded in UTF-8
. 我有一堆用UTF-8
编码的文本文件。 The text inside the files look like this: \\x6c\\x69b/\\x62\\x2f\\x6d\\x69nd/m\\x61x\\x2e\\x70h\\x70
. 文件中的文本如下所示: \\x6c\\x69b/\\x62\\x2f\\x6d\\x69nd/m\\x61x\\x2e\\x70h\\x70
。
I've copied all these text files and placed them into a directory /convert/
. 我已经复制了所有这些文本文件,并将它们放在目录/convert/
。
I need to read each file and convert the encoded literals into characters, then save the file. 我需要阅读每个文件并将编码的文字转换为字符,然后保存文件。 filename.converted.txt
What would be the smartest approach to do this? 什么是最聪明的方法来做到这一点? What can I do to convert to the new text? 如何转换为新文本? Is there a function for handling Unicode text to convert between the literal to character types? 是否有用于处理Unicode文本的函数,以在文字和字符类型之间进行转换? Should I be using a different programming language for this? 我是否应该为此使用其他编程语言?
This is what I have at the moment: 这是我目前所拥有的:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
public class decode {
public static void main(String args[]) {
File directory = new File("C:/convert/");
String[] files = directory.list();
boolean success = false;
for (String file : files) {
System.out.println("Processing \"" + file + "\"");
//TODO read each file and convert them into characters
success = true;
if (success) {
System.out.println("Successfully converted \"" + file + "\"");
} else {
System.out.println("Failed to convert \"" + file + "\"");
}
//save file
if (success) {
try {
FileWriter open = new FileWriter("C:/convert/" + file + ".converted.txt");
BufferedWriter write = new BufferedWriter(open);
write.write("TODO: write converted text into file");
write.close();
System.out.println("Successfully saved \"" + file + "\" conversion.");
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
}
(It looks like there's some confusion about what you mean - this answer assumes the input file is entirely in ASCII, and uses "\\x" to hex-encode any bytes which aren't in the ASCII range.) (您的意思似乎有些困惑-此答案假设输入文件完全是ASCII的,并使用“ \\ x”对不在ASCII范围内的任何字节进行十六进制编码。)
It sounds to me like the UTF-8 part of it is actually irrelevant. 在我看来,它的UTF-8部分实际上无关紧要。 You can treat it as opaque binary data for output. 您可以将其视为不透明的二进制数据进行输出。 Assuming the input file is entirely ASCII: 假设输入文件完全是ASCII:
FileInputStream
wrapped in InputStreamReader
specifying an encoding of "US-ASCII") 以文本形式打开输入文件(例如,使用包装在InputStreamReader
FileInputStream
指定“ US-ASCII”的编码) FileOutputStream
) 以二进制形式打开输出文件(例如,使用FileOutputStream
) char
to byte
) 如果不是,则将字符的ASCII值写入输出流(从char
到byte
只是大小写) You'll then have a "normal" UTF-8 file which should be readable by any text editor which supports UTF-8. 然后,您将获得一个“普通” UTF-8文件,任何支持UTF-8的文本编辑器都可以读取该文件。
java.io.InputStreamReader can be used to convert an input stream from an arbitrary charset into Java char
s. java.io.InputStreamReader可用于将输入流从任意字符集转换为Java char
。 I'm not exactly sure how you want to write it back out, though. 不过,我不确定您要如何将其写回。 Do you want non-ASCII characters to be written out as ASCII Unicode escape sequences? 您是否希望将非ASCII字符写为ASCII Unicode转义序列?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.