简体   繁体   English

如何知道文件中有哪个特殊字符?

[英]how to know which special character is there in a file?

My app needs to process text files during a batch process. 我的应用程序需要在批处理过程中处理文本文件。 Occassionally I receive a file with some special character at the end of the file. 有时我会收到文件末尾带有特殊字符的文件。 I am not sure what that special character is. 我不确定那个特殊字符是什么。 Is there anyway I can find what that character is so that I can tell the other team which is producing that file. 无论如何,我可以找到那个角色是什么,以便我可以告诉另一个正在制作该文件的团队。

I have used mozilla's library to guess the file encoding and it says UTF-8. 我已经使用mozilla的库来猜测文件编码,它说的是UTF-8。

First, if the character is really "special" or not depends what you call a "special character". 首先,字符是否真的是“特殊字符”取决于您所说的“特殊字符”。 As a sidenote on Unix and OS X you can use, for example, the od , file and hexdump commands to easily examine files: 作为Unix和OS X的旁注,您可以使用例如odfilehexdump命令轻松检查文件:

... $  hexdump -C example.txt 
00000530  6f 77 73 20 61 63 74 69  6f 6e 2e 0a 0a 0a 0a     |ows action.....|

Now if you know your file encoding is UTF-8, it means that every byte that has its highest bit set to zero correspond to exactly one character (in the example above, last byte is '0a', which means the '0a' byte correspond to one "character"). 现在,如果您知道文件编码为UTF-8,则意味着最高位设置为零的每个字节都恰好对应一个字符(在上面的示例中,最后一个字节为“ 0a”,这意味着“ 0a”字节对应一个“字符”)。

A file in UTF-8 also means that every byte whose highest bit is set to 1 is part of a multi-byte character. UTF-8中的文件还意味着,最高位设置为1的每个字节都是多字节字符的一部分。 For example, in the following byte sequence: 例如,按以下字节顺序:

75 20 5b e2 80 a6 5d 20  61 75 74 6f 72 69 73 61

the only three bytes that have their highest bit set are "e2 80 a6" (all the values from 0x80 to 0xFF have their leftmost/highest bit set) and they're part of the same character (you cannot have a non-ASCII character in UTF-8 made of only one byte whose highest bit is set, hence you know that these three bytes are part of the same character... The fact that every UTF-8 byte whose leftmost/highest bit is set is IMHO a truly beautiful feature of UTF-8). 设置了最高位的仅有三个字节是“ e2 80 a6”(从0x80到0xFF的所有值都设置了其最左/最高位),并且它们属于同一字符(不能具有非ASCII字符)在UTF-8中,它仅由设置了最高位的一个字节组成,因此您知道这三个字节是同一字符的一部分...事实上,最左/最高位被设置的每个UTF-8字节都是恕我直言UTF-8的漂亮功能)。

Now you Google on "e2 80 a6" and you see that it's the Unicode character named "horizontal ellipsis" (whose codepoint, in UTF-8, is represented by hexadecimal e280a6). 现在,您在“ e2 80 a6”上使用Google,您会看到它是名为“水平省略号”的Unicode字符(其编码点在UTF-8中由十六进制e280a6表示)。

So basically you have to do two things: 因此,基本上,您必须做两件事:

  • find which bytes are making up that last "special" character (is it just one byte or several bytes?) 查找组成最后一个“特殊”字符的字节(是一个字节还是几个字节?)

  • find to which "special character" this/these byte(s) corresponds 查找此/这些字节与哪个“特殊字符”相对应

Any hex editor ought to allow you to see each individual byte in a file. 任何十六进制编辑器都应允许您查看文件中的每个字节。 This ought to allow you to tell them what character it is. 这应该允许您告诉他们这是什么角色。

Here's one I've used in the past: http://www.hexworkshop.com/ 这是我过去使用过的: http : //www.hexworkshop.com/

在Unix上,可以使用od实用程序在文件或流中输出字节数据的几种表示形式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM