简体   繁体   English

C ++列出具有特殊非标准字符的文件

[英]C++ List files with special nonstandard characters

I'm to recursively list files and sub-directories of a given directory, and it's working so far (using dirent.h) but files featuring special characters such as the en dash or any Japanese or Chinese characters. 我要递归地列出给定目录的文件和子目录,并且到目前为止(使用dirent.h)它可以正常工作,但是文件带有特殊字符,例如破折号或任何日语或中文字符。

Full Code here https://gist.github.com/VikiMaster2/f14a19aa5cf042f0787467a37a616ded 完整代码在这里https://gist.github.com/VikiMaster2/f14a19aa5cf042f0787467a37a616ded

I only get '?'s for files containing odd characters in their names. 对于名称中包含奇数字符的文件,我只会得到'?'。 I understand that such characters cannot be displayed properly in a console and that dirent probably doesn't support non ASCII chars but how do I store all the paths to files and put them to use then? 我知道这些字符无法在控制台中正确显示,并且dirent可能不支持非ASCII字符,但是如何存储文件的所有路径并在使用时使用呢?

演示图片

Following is sample hexdump of a sample output(generated with simple command ./a.out>abcd.txt): 以下是示例输出的示例十六进制转储(通过简单命令./a.out>abcd.txt生成):

00000000  20 20 2d 20 61 2e 6f 75  74 0a 20 20 2d 20 61 62  |  - a.out.  - ab|
00000010  63 64 2e 74 78 74 0a 20  20 2d 20 76 69 65 77 73  |cd.txt.  - views|
00000020  6f 75 72 63 65 2e 63 73  73 0a 20 20 2d 20 e0 a4  |ource.css.  - ..|
00000030  b2 e0 a5 87 0a 20 20 2d  20 74 65 73 74 2e 63 0a  |.....  - test.c.|

and the file is: 文件是:

- a.out
- abcd.txt
- viewsource.css
- ले
- test.c

So now as you see that the non-ASCII character is a multibyte character and you can figure the encoding in which it is stored. 因此,现在您看到非ASCII字符是一个多字节字符,并且可以计算存储它的编码。 Once you understand the encoding in which it is stored it is trivial to read it. 一旦了解了存储在其中的编码,就可以轻松读取它。

The simplest way to know the encoding is execute file command like: 知道编码的最简单方法是执行file命令,例如:

$ file abcd.txt
abcd.txt: UTF-8 Unicode text

However, this is how redirection saves it. 但是,这就是重定向保存它的方式。 You can store it in any encoding you want with UTF-8 being a very particular/good choice. 您可以将其存储为所需的任何编码,UTF-8是一个非常特殊的选择。 Now all you needs to handle is UTF-8 encoding. 现在,您只需要处理UTF-8编码即可。 There are libraries which will help you with this but you can always try to do it yourself. 有一些库可以帮助您解决此问题,但是您始终可以尝试自己做。

EDIT 1: I am sorry that I did not observe that you are on Windows and I used Linux for file command. 编辑1:很抱歉,我没有观察到您在Windows上,并且我使用Linux进行file命令。 I do not know if Windows has file command. 我不知道Windows是否具有文件命令。 But you can detect the presence of UTF-8 character by yourself in your code. 但是您可以自己在代码中检测到UTF-8字符的存在。 It is very simple to code that and I think that you will be able to do it. 编写代码非常简单,我认为您将可以做到。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM