简体   繁体   English

linux +验证文件是文本还是二进制文件

[英]linux + verify if file is text or binary

如何在不打开文件的情况下验证文件是二进制还是文本?

Schrödinger's cat, I'm afraid. 我害怕薛定谔的猫。

There is no way to determine the contents of a file without opening it. 没有打开它就无法确定文件的内容。 The filesystem stores no metadata relating to the contents. 文件系统不存储与内容相关的元数据。

If not opening the file is not a hard requirement, then there are a number of solutions available to you. 如果不打开文件并不是一项艰难的要求,那么您可以使用多种解决方案。

Edit: 编辑:

It has been suggested in a number of comments and answers that file(1) is a good way of determining the contents. 在许多评论和答案中已经提出, file(1)是确定内容的好方法。 Indeed it is. 的确是。 However, file(1) opens the file, which was prohibited in the question. 但是, file(1) 打开文件,问题中禁止该文件。 See the penultimate line in the following example: 请参阅以下示例中的倒数第二行:

> echo 'This is not a pipe' > file.jpg && strace file file.jpg 2>&1 | grep file.jpg
execve("/usr/bin/file", ["file", "file.jpg"], [/* 56 vars */]) = 0
lstat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
stat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
open("file.jpg", O_RDONLY|O_LARGEFILE)  = 3
write(1, "file.jpg: ASCII text\n", 21file.jpg: ASCII text

The correct way to determine the type of a file is to use the file(1) command. 确定文件类型的正确方法是使用file(1)命令。

You also need to be aware that UTF-8 encoded files are "text" files, but may contain non-ASCII data. 您还需要注意UTF-8编码的文件是“文本”文件,但可能包含非ASCII数据。 Other encodings also have this issue. 其他编码也有这个问题。 In the case of text encoded with a code page , it may not be possible to unambiguously determine if a file is text or not. 在使用代码页编码的文本的情况下,可能无法明确地确定文件是否是文本。

The file(1) command will look at the structure of a file to try and determine what it contains - from the file(1) man page: file(1)命令将查看文件的结构以尝试确定它包含的内容 - 来自file(1)手册页:

The type printed will usually contain one of the words text (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is usually 'binary' or non-printable). 打印的类型通常包含一个单词text (该文件只包含打印字符和一些常用控制字符,并且可以安全地在ASCII终端上读取), 可执行文件 (该文件包含在表单中编译程序的结果)某些UNIX内核或其他内容可以理解,或者数据意味着其他任何东西(数据通常是'二进制'或不可打印)。

With regard to different character encodings, the file(1) man page has this to say: 关于不同的字符编码,文件(1)手册页有这样的说法:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. 如果文件与魔术文件中的任何条目都不匹配,则会检查它是否看起来像是文本文件。 ASCII, ISO-8859-x, non- ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. ASCII,ISO-8859-x,非ISO 8位扩展ASCII字符集(例如Macintosh和IBM PC系统上使用的字符集),UTF-8编码的Unicode,UTF-16编码的Unicode和EBCDIC字符集可以通过在每个集合中构成可打印文本的不同范围和字节序列来区分。 If a file passes any of these tests, its character set is reported. 如果文件通过任何这些测试,则报告其字符集。 ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as 'text' because they will be mostly readable on nearly any terminal; ASCII,ISO-8859-x,UTF-8和扩展ASCII文件被标识为“文本”,因为它们几乎可以在任何终端上读取; UTF-16 and EBCDIC are only 'character data' because, while they contain text, it is text that will require translation before it can be read. UTF-16和EBCDIC只是'字符数据',因为虽然它们包含文本,但是在可以读取之前需要翻译的文本。

So, some text will be identified as text , but some may be identified as character data . 因此,一些文本将被识别为文本 ,但有些文本可能被识别为字符数据 You will need to determine yourself if this matters to your application and take appropriate action. 您需要确定自己是否对您的申请很重要并采取适当的措施。

There is no way of being certain without looking inside the file. 没有查看文件内部就无法确定。 Hoewever, you don't have to open it with an editor and see for yourself to have a clue. Hoewever,你不必用编辑打开它,看看你自己有一个线索。 You may want to look into the file command: http://linux.die.net/man/1/file 您可能需要查看file命令: http//linux.die.net/man/1/file

If you are attempting to do this from a command shell then the file command will take a guess at what filetype it is. 如果您尝试从命令shell执行此操作,则file命令将猜测它是什么文件类型。 If it is text then it will generally include the word text in its description. 如果是文本,则通常在其描述中包括文字。

I am not aware of any 100% method of determining this but the file command is probably the most accurate. 我不知道有任何100%的方法来确定这个,但文件命令可能是最准确的。

In unix, a file is just some bytes. 在unix中,文件只是一些字节。 So, without opening the file, you cannot figure out 100% that's it's ASCII or Binary. 因此,在不打开文件的情况下,你无法弄清楚100%是ASCII还是二进制。

You can just use tools available to you and dig deeper to make it fool proof. 您可以使用可用的工具并深入挖掘以使其成为傻瓜证明。

  1. file 文件
  2. cat -v 猫-v

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM