简体   繁体   中英

how to know which special character is there in a file?

My app needs to process text files during a batch process. Occassionally I receive a file with some special character at the end of the file. I am not sure what that special character is. Is there anyway I can find what that character is so that I can tell the other team which is producing that file.

I have used mozilla's library to guess the file encoding and it says UTF-8.

First, if the character is really "special" or not depends what you call a "special character". As a sidenote on Unix and OS X you can use, for example, the od , file and hexdump commands to easily examine files:

... $  hexdump -C example.txt 
00000530  6f 77 73 20 61 63 74 69  6f 6e 2e 0a 0a 0a 0a     |ows action.....|

Now if you know your file encoding is UTF-8, it means that every byte that has its highest bit set to zero correspond to exactly one character (in the example above, last byte is '0a', which means the '0a' byte correspond to one "character").

A file in UTF-8 also means that every byte whose highest bit is set to 1 is part of a multi-byte character. For example, in the following byte sequence:

75 20 5b e2 80 a6 5d 20  61 75 74 6f 72 69 73 61

the only three bytes that have their highest bit set are "e2 80 a6" (all the values from 0x80 to 0xFF have their leftmost/highest bit set) and they're part of the same character (you cannot have a non-ASCII character in UTF-8 made of only one byte whose highest bit is set, hence you know that these three bytes are part of the same character... The fact that every UTF-8 byte whose leftmost/highest bit is set is IMHO a truly beautiful feature of UTF-8).

Now you Google on "e2 80 a6" and you see that it's the Unicode character named "horizontal ellipsis" (whose codepoint, in UTF-8, is represented by hexadecimal e280a6).

So basically you have to do two things:

  • find which bytes are making up that last "special" character (is it just one byte or several bytes?)

  • find to which "special character" this/these byte(s) corresponds

Any hex editor ought to allow you to see each individual byte in a file. This ought to allow you to tell them what character it is.

Here's one I've used in the past: http://www.hexworkshop.com/

在Unix上,可以使用od实用程序在文件或流中输出字节数据的几种表示形式。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM