简体   繁体   English

Java 无法打开文件名中包含代理 Unicode 值的文件?

[英]Java Can't Open a File with Surrogate Unicode Values in the Filename?

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames.我正在处理对文件执行各种 IO 操作的代码,我想让它能够处理国际文件名。 I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file.我正在使用 Java 1.5 的 Mac 上工作,如果文件名包含需要代理的 Unicode 字符,则 JVM 似乎无法找到该文件。 For example, my test file is:例如,我的测试文件是:

"草鷗外.gif" which gets broken into the Java characters \草\?\?\鷗\外.gif破解成Java字符\草\?\?\鷗\外.gif "草鷗外.gif"

If I create a file from this filename, I can't open it because I get a FileNotFound exception.如果我从此文件名创建文件,则无法打开它,因为出现 FileNotFound 异常。 Even using this on the folder containing the file will fail:即使在包含文件的文件夹上使用它也会失败:

File[] files = folder.listFiles(); 
for (File file : files) {
    if (!file.exists()) {
        System.out.println("Failed to find File"); //Fails on the surrogate filename
    }
}

Most of the code I am actually dealing with are of the form:我实际处理的大部分代码都是以下形式:

FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow

Is there some way I can address this problem, either escaping the filenames or opening files differently?有什么方法可以解决这个问题,要么转义文件名,要么以不同方式打开文件?

I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8.我怀疑 Java 或 Mac 之一正在使用CESU-8而不是正确的 UTF-8。 Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Java 使用“修改后的 UTF-8”(这是 CESU-8 的轻微变体)用于各种内部目的,但我不知道它可以将它用作文件系统/defaultCharset。 Unfortunately I have neither Mac nor Java here to test with.不幸的是,我这里既没有 Mac 也没有 Java 可以测试。

“Modified” is a modified way of saying “badly bugged”. “Modified”是“badly bugged”的一种修改方式。 Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:而不是为像𦿶这样的补充(非 BMP)字符输出一个四字节的 UTF-8 序列:

\xF0\xA6\xBF\xB6

it outputs a UTF-8-encoded sequence for each of the surrogates:它为每个代理输出一个 UTF-8 编码的序列:

\xED\xA1\x9B\xED\xBF\xB6

This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway.这不是有效的 UTF-8 序列,但无论如何很多解码器都会允许它。 Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above.问题是,如果您通过真正的 UTF-8 编码器来回传输,您会得到一个不同的字符串,即上面的四字节字符串。 Try to access the file with that name and boom!尝试访问具有该名称的文件并繁荣! fail.失败。

So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:所以首先让我们检查文件名是如何实际存储在当前文件系统下的,使用一个使用字节作为文件名的平台,例如 Python 2.x:

$ python
Python 2.x.something (blah blah)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')

On my filesystem (Linux, ext4, UTF-8), the filename “草𦿶鷗外.gif” comes out as:在我的文件系统(Linux、ext4、UTF-8)上,文件名“草𦿶鸥外.gif”显示为:

['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

which is what you want.这就是你想要的。 If that's what you get, it's probably Java doing it wrong.如果这就是你得到的,那很可能是 Java 做错了。 If you get the longer six-byte-character version:如果您获得更长的六字节字符版本:

['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

it's probably OS X doing it wrong... does it always store filenames like this?这可能是 OS X 做错了……它总是存储这样的文件名吗? (Or did the files come from somewhere else originally?) What if you rename the file to the 'proper' version?: (或者文件最初来自其他地方吗?)如果您将文件重命名为“正确”版本怎么办?:

os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')

If your environment's default locale does not include those characters you cannot open the file.如果您环境的默认语言环境不包含这些字符,您将无法打开该文件。

See: File.exists() fails with unicode characters in name请参阅: File.exists() 失败,名称中包含 unicode 字符

Edit: Alright.. What you need is to change the system locale.编辑:好的..您需要的是更改系统区域设置。 Whatever OS you are using.无论您使用什么操作系统。

Edit :编辑

See: How can I open files containing accents in Java?请参阅: 如何在 Java 中打开包含重音符号的文件?

See: JFileChooser on Mac cannot see files named by Chinese chars?请参阅: Mac 上的 JFileChooser 看不到以中文字符命名的文件?

This turned out to be a problem with the Mac JVM (tested on 1.5 and 1.6).结果证明这是 Mac JVM 的问题(在 1.5 和 1.6 上测试)。 Filenames containing supplementary characters / surrogate pairs cannot be accessed with the Java File class.不能使用 Java File 类访问包含补充字符/代理项对的文件名。 I ended up writing a JNI library with Carbon calls for the Mac version of the project (ick).我最终编写了一个 JNI 库,其中包含针对 Mac 版本项目 (ick) 的 Carbon 调用。 I suspect the CESU-8 issue bobince mentioned, as the JNI call to get UTF-8 characters returned a CESU-8 string.我怀疑 bobince 提到的 CESU-8 问题,因为获取 UTF-8 字符的 JNI 调用返回了 CESU-8 字符串。 Doesn't look like it's something you can really get around.看起来这不是你可以真正解决的问题。

It's a bug in the old-skool java File api, maybe just on a mac?这是 old-skool java File api 中的一个错误,也许只是在 mac 上? Anyway, the new java.nio api works much better.无论如何,新的 java.nio api 工作得更好。 I have several files containing unicode characters and content that failed to load using java.io.File and related classes.我有几个包含 unicode 字符和内容的文件,这些文件无法使用 java.io.File 和相关类加载。 After converting all my code to use java.nio.Path EVERYTHING started working.将我所有的代码转换为使用java.nio.Path 后,一切都开始工作了。 And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files ...我用java.nio.Files替换了 org.apache.commons.io.FileUtils (有同样的问题)...

...and be sure to read and write the content of file using an appropriate charset, for example: Files.readAllLines(myPath, StandardCharsets.UTF_8) ...并确保使用适当的字符集读取和写入文件的内容,例如: Files.readAllLines(myPath, StandardCharsets.UTF_8)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM