简体   繁体   English

如何确定文件是二进制文件还是文本文件?

[英]How can i determine if a file is binary or text?

I'm writing an application where i need to determine if the files provided from the user are text or not because i'm performing a search within them. 我正在编写一个应用程序,我需要确定用户提供的文件是否为文本,因为我正在其中进行搜索。

I'm not basing on the extension, cause i want to search also in source code files for example, or any other file that have a textual content (even for not well known extensions). 我不是基于扩展名的,因为我也想搜索例如源代码文件或具有文本内容的任何其他文件(即使不知名的扩展名)。

Is there a way to determine if a file is text or not? 有没有办法确定文件是否为文本?

Thanks everyone for the solutions provided! 感谢大家提供的解决方案! I just found a framework that seems to do the job quite well! 我只是找到了一个看起来做得很好的框架!

I leave here a link for reference: https://github.com/aidansteele/MagicKit 我在这里留下了一个参考链接: https : //github.com/aidansteele/MagicKit

There is no way to be certain. 没有办法确定。 But note that most of the control characters would not appear in an ASCII file. 但是请注意,大多数控制字符不会出现在ASCII文件中。 You can make a pretty good guess by making a subset of most of the ASCII control characters. 您可以通过对大多数ASCII控制字符进行子集化来做出很好的猜测。 Then count the number of characters in the file that are in the subset, the count should be zero for an ASCII file. 然后计算子集中文件中的字符数,对于ASCII文件,该计数应为零。 But in the final analysis you must prove a negative, which is a troublesome thing to do. 但是,归根结底,您必须证明是否定的,这是一件麻烦的事。

You would need to open and read the data. 您将需要打开并读取数据。

For ASCII text files, this means checking the characters are in the printable range. 对于ASCII文本文件,这意味着检查字符是否在可打印范围内。

For UTF text files, you may need to read the BOM (Byte Order Mark) first to determine encoding before reading the rest of the file. 对于UTF文本文件,在读取文件的其余部分之前,可能需要先阅读BOM(字节顺序标记)以确定编码。

Read more here: http://en.wikipedia.org/wiki/Text_file 在此处阅读更多信息: http : //en.wikipedia.org/wiki/Text_file

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM