简体   繁体   English

如何使用 UTF-8 字符串在 PHP 中使用文件系统函数?

[英]How do I use filesystem functions in PHP, using UTF-8 strings?

I can't use mkdir to create folders with UTF-8 characters:我不能使用mkdir创建带有 UTF-8 字符的文件夹:

<?php
$dir_name = "Depósito";
mkdir($dir_name);
?>

when I browse this folder in Windows Explorer, the folder name looks like this:当我在 Windows 资源管理器中浏览此文件夹时,文件夹名称如下所示:

Depósito

What should I do?我该怎么办?

I'm using php5我正在使用 php5

Just urlencode the string desired as a filename.只需将所需的字符串urlencode作为文件名。 All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).urlencode返回的所有字符在文件名 (NTFS/HFS/UNIX) 中都是有效的,然后您可以将文件名urldecode回 UTF-8(或它们采用的任何编码)。

Caveats (all apply to the solutions below as well):注意事项(均适用于以下解决方案):

  • After url-encoding, the filename must be less that 255 characters (probably bytes). url 编码后,文件名必须少于 255 个字符(可能是字节)。
  • UTF-8 has multiple representations for many characters (using combining characters). UTF-8 对许多字符有多种表示(使用组合字符)。 If you don't normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.如果您不规范化 UTF-8,则可能无法使用glob搜索或重新打开单个文件。
  • You can't rely on scandir or similar functions for alpha-sorting.您不能依赖scandir或类似函数进行 alpha 排序。 You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).您必须对文件名进行urldecode ,然后使用urldecode UTF-8(和排序规则)的排序算法。

Worse Solutions更糟糕的解决方案

The following are less attractive solutions, more complicated and with more caveats.以下是不太吸引人的解决方案,但更复杂,并且有更多的注意事项。

On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names.在 Windows 上,PHP 文件系统包装器期望并返回文件/目录名称的 ISO-8859-1 字符串。 This gives you two choices:这给了你两个选择:

  1. Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP.在您的文件名中自由使用 UTF-8,但要了解非 ASCII 字符在 PHP 之外会显示不正确 A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters.非 ASCII UTF-8 字符将存储为多个单个ISO-8859-1 字符。 Eg ó will be appear as ó in Windows Explorer.例如ó将在 Windows 资源管理器中显示为ó

  2. Limit your file/directory names to characters representable in ISO-8859-1 .将您的文件/目录名称限制为可在 ISO-8859-1 中表示的字符 In practice, you'll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.实际上,在文件系统函数中使用它们之前,您将通过utf8_decode传递 UTF-8 字符串,并通过utf8_encode传递scandir提供的条目以获取 UTF-8 中的原始文件名。

Caveats galore!警告一应俱全!

  • If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you're out of luck.如果传递给文件系统函数的任何字节与 ISO-8859-1 中的无效 Windows 文件系统字符匹配,那么您就不走运了。
  • Windows may use an encoding other than ISO-8859-1 in non-English locales. Windows可能会在非英语语言环境中使用 ISO-8859-1 以外的编码。 I'd guess it will usually be one of ISO-8859-#, but this means you'll need to use mb_convert_encoding instead of utf8_decode .我猜它通常是 ISO-8859-# 之一,但这意味着您需要使用mb_convert_encoding而不是utf8_decode

This nightmare is why you should probably just transliterate to create filenames.这个噩梦就是为什么你应该音译来创建文件名。

Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale() ).在 Unix 和 Linux(也可能在 OS X 下)下,当前文件系统编码由LC_CTYPE语言环境参数给出(请参阅函数setlocale() )。 For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8.例如,它的计算结果可能类似于en_US.UTF-8 ,这意味着编码是 UTF-8。 Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.然后文件名及其路径可以使用fopen()创建或通过dir()使用此编码检索。

Under Windows, PHP operates as a "non-Unicode aware program", then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected "code page".在 Windows 下,PHP 作为“非 Unicode 识别程序”运行,然后文件名从文件系统(Windows 2000 及更高版本)使用的 UTF-16 来回转换为选定的“代码页”。 The control panel "Regional and Language Options", tab panel "Formats" sets the code page retrieved by the LC_CTYPE option, while the "Administrative -> Language for non-Unicode Programs" sets the translation code page for file names.控制面板“区域和语言选项”、选项卡面板“格式”设置由LC_CTYPE选项检索的代码页,而“管理 -> 非 Unicode 程序的语言”设置文件名的翻译代码页。 In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as "Windows-1252 encoding" which is similar (but not exactly equal) to ISO-8859-1.在西方国家, LC_CTYPE参数的计算结果类似于language_country.1252 ,其中 1252 是代码页,也称为“Windows-1252 编码”,它类似于(但不完全等于)ISO-8859-1。 In Japan the 932 code page is usually set instead, and so on for other countries.在日本,通常设置 932 代码页,其他国家也如此。 Under PHP you may create files whose name can be expressed with the current code page.在 PHP 下,您可以创建名称可以用当前代码页表示的文件。 Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the "best-fit" current code page .反之亦然,从文件系统检索的文件名和路径使用“最适合”的当前代码页从 UTF-16 转换为字节。

This mapping is approximated, so some characters might be mangled in an unpredictable way.此映射是近似的,因此某些字符可能会以不可预测的方式被破坏。 For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\\xE9 Brill\\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their "best-fit" non-accented vowels.例如,如果当前代码页为 1252,则Caffé Brillì.txt将由dir()作为 PHP 字符串Caff\\xE9 Brill\\xEC.txt返回,而在日语系统上它将返回近似的Caffe Brilli.txt因为 932 代码页中缺少重音元音,然后替换为“最适合”的非重音元音。 Characters that cannot be translated at all are retrieved as ?根本无法翻译的字符被检索为? (question mark). (问号)。 In general, under Windows there is no safe way to detect such artifacts.通常,在 Windows 下没有安全的方法来检测此类工件。

More details are available in my reply to the PHP bug no.我对PHP 错误号的回复中提供了更多详细信息 47096 . 47096

PHP 7.1 在 Windows 上支持 UTF-8 文件名,不考虑 OEM 代码页。

The problem is that Windows uses utf-16 for filesystem strings, whereas Linux and others use different character sets, but often utf-8.问题是 Windows 对文件系统字符串使用 utf-16,而 Linux 和其他人使用不同的字符集,但通常使用 utf-8。 You provided a utf-8 string, but this is interpreted as another 8-bit character set encoding in Windows, maybe Latin-1, and then the non-ascii character, which is encoded with 2 bytes in utf-8, is handled as if it was 2 characters in Windows.您提供了一个 utf-8 字符串,但这在 Windows 中被解释为另一个 8 位字符集编码,可能是 Latin-1,然后用 utf-8 中的 2 个字节编码的非 ascii 字符被处理为如果它是 Windows 中的 2 个字符。

A normal solution is to keep your source code 100% in ascii, and to have strings somewhere else.正常的解决方案是将源代码 100% 保留为 ASCII,并在其他地方使用字符串。

Using the com_dotnet PHP extension, you can access Windows' Scripting.FileSystemObject , and then do everything you want with UTF-8 files/folders names.使用com_dotnet PHP 扩展,您可以访问 Windows 的Scripting.FileSystemObject ,然后使用 UTF-8 文件/文件夹名称执行您想要的任何操作。

I packaged this as a PHP stream wrapper, so it's very easy to use :我将其打包为 PHP 流包装器,因此非常易于使用:

https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php https://github.com/nicolas-grekas/Patchwork-UTF8/blob/lab-windows-fs/class/Patchwork/Utf8/WinFsStreamWrapper.php

First verify that the com_dotnet extension is enabled in your php.ini then enable the wrapper with:首先验证您的php.ini是否启用了com_dotnet扩展,然后使用以下命令启用包装器:

stream_wrapper_register('win', 'Patchwork\Utf8\WinFsStreamWrapper');

Finally, use the functions you're used to (mkdir, fopen, rename, etc.), but prefix your path with win://最后,使用您习惯的函数(mkdir、fopen、rename 等),但在路径前加上win://

For example:例如:

<?php
$dir_name = "Depósito";
mkdir('win://' . $dir_name );
?>

You could use this extension to solve your issue: https://github.com/kenjiuno/php-wfio您可以使用此扩展来解决您的问题: https : //github.com/kenjiuno/php-wfio

$file = fopen("wfio://多国語.txt", "rb"); // in UTF-8
....
fclose($file);

My set of tools to use filesystem with UTF-8 on windows OR linux via PHP and compatible with .htaccess check file exists:我的一组工具在 Windowslinux 上通过PHP使用带有 UTF-8 的文件系统并与.htaccess检查文件兼容:

function define_cur_os(){

    //$cur_os=strtolower(php_uname());

    $cur_os=strtolower(PHP_OS);

    if(substr($cur_os, 0, 3) === 'win'){

        $cur_os='windows';

    }

    define('CUR_OS',$cur_os);

}

function filesystem_encode($file_name=''){

    $file_name=urldecode($file_name);

    if(CUR_OS=='windows'){

        $file_name=iconv("UTF-8", "ISO-8859-1//TRANSLIT", $file_name);

    }     

    return $file_name;

}

function custom_mkdir($dir_path='', $chmod=0755){

    $dir_path=filesystem_encode($dir_path);

    if(!is_dir($dir_path)){

        if(!mkdir($dir_path, $chmod, true)){

            //handle mkdir error

        }
    }
    return $dir_path;
}

function custom_fopen($dir_path='', $file_name='', $mode='w'){

    if($dir_path!='' && $file_name!=''){

        $dir_path=custom_mkdir($dir_path);

        $file_name=filesystem_encode($file_name);

        return fopen($dir_path.$file_name, $mode);

    }

    return false;

}

function custom_file_exists($file_path=''){

    $file_path=filesystem_encode($file_path);

    return file_exists($file_path);

}

function custom_file_get_contents($file_path=''){

    $file_path=filesystem_encode($file_path);

    return file_get_contents($file_path);

}

Additional resources其他资源

I don't need to write much, it works well:我不需要写太多,它运行良好:

<?php
$dir_name = mb_convert_encoding("Depósito", "ISO-8859-1", "UTF-8");
mkdir($dir_name);
?>

这个链接尝试 CodeIgniter Text helper 阅读关于 convert_accented_characters() 函数,它可以被服装化

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM