简体   繁体   English

如何在Linux中使用wchar_t *包含非Ascii字符串的文件打开文件?

[英]How to open a file with wchar_t* containing non-Ascii string in Linux?

Environment: Gcc/G++ Linux 环境:Gcc / G ++ Linux

I have a non-ascii file in file system and I'm going to open it. 我在文件系统中有一个非ascii文件,我打算将其打开。

Now I have a wchar_t*, but I don't know how to open it. 现在我有一个wchar_t *,但是我不知道如何打开它。 (my trusted fopen only opens char* file) (我信任的fopen仅打开char *文件)

Please help. 请帮忙。 Thanks a lot. 非常感谢。

There are two possible answers: 有两个可能的答案:

If you want to make sure all Unicode filenames are representable, you can hard-code the assumption that the filesystem uses UTF-8 filenames. 如果要确保所有Unicode文件名都可表示,则可以对文件系统使用UTF-8文件名的假设进行硬编码。 This is the "modern" Linux desktop-app approach. 这是“现代” Linux桌面应用程序方法。 Just convert your strings from wchar_t (UTF-32) to UTF-8 with library functions ( iconv would work well) or your own implementation (but lookup the specs so you don't get it horribly wrong like Shelwien did), then use fopen . 只需使用库函数( iconv可以正常工作)或您自己的实现将wchar_t (UTF-32)的字符串转换为UTF-8即可,也可以使用自己的实现(但是请查看规范,以免您不像Shelwien那样犯错),然后使用fopen

If you want to do things the more standards-oriented way, you should use wcsrtombs to convert the wchar_t string to a multibyte char string in the locale's encoding (which hopefully is UTF-8 anyway on any modern system) and use fopen . 如果您想以更标准的方式进行操作,则应使用wcsrtombs以语言环境的编码将wchar_t字符串转换为多字节char字符串(希望在任何现代系统中均为UTF-8),并使用fopen Note that this requires that you previously set the locale with setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "") . 请注意,这要求您事先使用setlocale(LC_CTYPE, "")setlocale(LC_ALL, "")设置语言环境。

And finally, not exactly an answer but a recommendation: 最后,不是一个确切的答案而是一个建议:

Storing filenames as wchar_t strings is probably a horrible mistake. 将文件名存储为wchar_t字符串可能是一个可怕的错误。 You should instead store filenames as abstract byte strings, and only convert those to wchar_t just-in-time for displaying them in the user interface (if it's even necessary for that; many UI toolkits use plain byte strings themselves and do the interpretation as characters for you). 相反,您应该将文件名存储为抽象字节字符串,并且仅将它们及时转换为wchar_t以便在用户界面中显示它们(如果这样做是必要的;许多UI工具箱本身都使用纯字节字符串并将其解释为字符为了你)。 This way you eliminate a lot of possible nasty corner cases, and you never encounter a situation where some files are inaccessible due to their names. 这样,您就消除了很多可能的讨厌的情况,而且您永远不会遇到由于文件名而无法访问某些文件的情况。

Linux is not UTF-8, but it's your only choice for filenames anyway Linux不是UTF-8,但无论如何它是文件名的唯一选择

(Files can have anything you want inside them.) (文件可以有你想要他们里面任何东西。)


With respect to filenames, linux does not really have a string encoding to worry about. 关于文件名,Linux确实不需要担心字符串编码。 Filenames are byte strings that need to be null-terminated. 文件名是字节字符串,需要以空字符结尾。

This doesn't precisely mean that Linux is UTF-8, but it does mean that it's not compatible with wide characters as they could have a zero in a byte that's not the end byte. 这并不完全意味着Linux是UTF-8,但这确实意味着它与宽字符不兼容,因为它们在一个字节(不是结尾字节)中可能有一个零。

But UTF-8 preserves the no-nulls-except-at-the-end model, so I have to believe that the practical approach is "convert to UTF-8" for filenames. 但是UTF-8保留了除结尾处没有空值的模型,因此我不得不相信实际的方法是将文件名“转换为UTF-8”。

The content of files is a matter for standards above the Linux kernel level, so here there isn't anything Linux-y that you can or want to do. 文件的内容是Linux内核级别以上的标准的问题,因此这里没有您可以或想要做的Linux-y。 The content of files will be solely the concern of the programs that read and write them. 文件的内容将仅是读写文件的程序所关心的。 Linux just stores and returns the byte stream, and it can have all the embedded nuls you want. Linux只是存储并返回字节流,它可以具有所需的所有嵌入式nul。

Convert wchar string to utf8 char string, then use fopen. 将wchar字符串转换为utf8字符串,然后使用fopen。

typedef unsigned int   uint;
typedef unsigned short word;
typedef unsigned char  byte;

int UTF16to8( wchar_t* w, char* s ) {
  uint  c;
  word* p = (word*)w;
  byte* q = (byte*)s; byte* q0 = q;
  while( 1 ) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x080 ) *q++ = c; else 
      if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else 
        *q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63);
  }
  *q = 0;
  return q-q0;
}

int UTF8to16( char* s, wchar_t* w ) {
  uint  cache,wait,c;
  byte* p = (byte*)s;
  word* q = (word*)w; word* q0 = q;
  while(1) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x80 ) cache=c,wait=0; else
      if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else 
        if( (c>=0xE0) ) cache=c&15,wait=2; else
          if( wait ) (cache<<=6)+=c&63,wait--;
    if( wait==0 ) *q++=cache;
  }
  *q = 0;
  return q-q0;
}
// locals
string file_to_read;           // any file
wstring file;                  // read ascii or non-ascii file here 
FILE *stream;
int read = 0;    
wchar_t buffer= '0';

if( fopen_s( &stream, file_to_read.c_str(), "r+b" ) == 0 )   // in binary mode
  {      
      while( !feof( stream ))
      { 
     // if ascii file second arg must be sizeof(char). if non ascii file sizeof( wchar_t)
        read = fread( & buffer, sizeof( char ), 1, stream );  
        file.append(1, buffer);
      }
  }

file.pop_back(); // since this code reads the last character twice.Throw the last one
fclose(stream);

// and the file is in wstring format.You can use it in any C++ wstring operation
// this code is fast enough i think, at least in my practice
// for windows because of fopen_s

Check out this document 查看此文件

http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

I think Linux follows POSIX standard, which treats all file names as UTF-8. 我认为Linux遵循POSIX标准,该标准将所有文件名都视为UTF-8。

I take it it's the name of the file that contains non-ascii characters, not the file itself, when you say "non-ascii file in file system". 当您说“文件系统中的非ascii文件”时,我认为它是包含非ascii字符的文件的名称,而不是文件本身。 It doesn't really matter what the file contains. 文件包含什么并不重要。

You can do this with normal fopen, but you'll have to match the encoding the filesystem uses. 您可以使用普通的fopen来做到这一点,但必须匹配文件系统使用的编码。

It depends on what version of Linux and what filesystem you're using and how you've set it up, but likely, if you're lucky, the filesystem uses UTF-8. 这取决于您使用的Linux版本和所使用的文件系统以及如何设置它,但是,如果幸运的话,文件系统可能会使用UTF-8。 So take your wchar_t (which is probably a UTF-16 encoded string?), convert it to a char string encoded in UTF-8, and pass that to fopen. 因此,请使用您的wchar_t(可能是UTF-16编码的字符串?),将其转换为以UTF-8编码的char字符串,然后将其传递给fopen。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM