简体   繁体   English

如何在 Python 中检测文件是否为二进制(非文本)文件?

[英]How can I detect if a file is binary (non-text) in Python?

How can I tell if a file is binary (non-text) in Python?如何判断 Python 中的文件是否为二进制(非文本)文件?

I am searching through a large set of files in Python, and keep getting matches in binary files.我正在搜索 Python 中的大量文件,并不断在二进制文件中找到匹配项。 This makes the output look incredibly messy.这使得 output 看起来非常混乱。

I know I could use grep -I , but I am doing more with the data than what grep allows for.我知道我可以使用grep -I ,但我对数据的处理比 grep 允许的要多。

In the past, I would have just searched for characters greater than 0x7f , but utf8 and the like, make that impossible on modern systems.过去,我只会搜索大于0x7f的字符,但utf8等在现代系统上是不可能的。 Ideally, the solution would be fast.理想情况下,解决方案会很快。

Yet another method based on file(1) behavior :另一种基于 file(1) 行为的方法

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

Example:例子:

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

You can also use the mimetypes module:您还可以使用mimetypes模块:

import mimetypes
...
mime = mimetypes.guess_type(file)

It's fairly easy to compile a list of binary mime types.编译二进制 mime 类型列表相当容易。 For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.例如,Apache 分发了一个 mime.types 文件,您可以将其解析为一组列表、二进制和文本,然后检查 mime 是否在您的文本或二进制列表中。

If you're using python3 with utf-8 it is straight forward, just open the file in text mode and stop processing if you get an UnicodeDecodeError .如果您将 python3 与 utf-8 一起使用,则很简单,只需以文本模式打开文件并在收到UnicodeDecodeError停止处理。 Python3 will use unicode when handling files in text mode (and bytearray in binary mode) - if your encoding can't decode arbitrary files it's quite likely that you will get UnicodeDecodeError . Python3 在处理文本模式下的文件时将使用 unicode(和二进制模式下的 bytearray)——如果你的编码不​​能解码任意文件,你很可能会得到UnicodeDecodeError

Example:例子:

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

Try this:试试这个:

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <TrentM@ActiveState.com>
    @author: Jorge Orpinel <jorge@orpinel.com>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False

If it helps, many many binary types begin with a magic numbers.如果有帮助,许多二进制类型都以幻数开头。 Here is a list of file signatures.这是文件签名列表

Use binaryornot library ( GitHub ).使用binaryornot库 ( GitHub )。

It is very simple and based on the code found in this stackoverflow question.它非常简单,并且基于在这个 stackoverflow 问题中找到的代码。

You can actually write this in 2 lines of code, however this package saves you from having to write and thoroughly test those 2 lines of code with all sorts of weird file types, cross-platform.你实际上可以用 2 行代码来编写它,但是这个包使你不必编写和彻底测试这 2 行代码与各种奇怪的文件类型,跨平台。

We can use python itself to check if a file is binary, because it fails if we try to open binary file in text mode我们可以使用 python 本身来检查文件是否是二进制文件,因为如果我们尝试以文本模式打开二进制文件,它会失败

def is_binary(file_name):
    try:
        with open(file_name, 'tr') as check_file:  # try open file in text mode
            check_file.read()
            return False
    except:  # if fail then file is non-text (binary)
        return True

Here's a suggestion that uses the Unix file command:这是使用 Unix file命令的建议:

import re
import subprocess

def istext(path):
    return (re.search(r':.* text',
                      subprocess.Popen(["file", '-L', path], 
                                       stdout=subprocess.PIPE).stdout.read())
            is not None)

Example usage:用法示例:

>>> istext('/etc/motd') 
True
>>> istext('/vmlinuz') 
False
>>> open('/tmp/japanese').read()
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'
>>> istext('/tmp/japanese') # works on UTF-8
True

It has the downsides of not being portable to Windows (unless you have something like the file command there), and having to spawn an external process for each file, which might not be palatable.它的缺点是不能移植到 Windows(除非你在那里有类似file命令的东西),并且必须为每个文件生成一个外部进程,这可能不太好。

A shorter solution, with a UTF-16 warning:一个较短的解决方案,带有 UTF-16 警告:

def is_binary(filename):
    """ 
    Return true if the given filename appears to be binary.
    File is considered to be binary if it contains a NULL byte.
    FIXME: This approach incorrectly reports UTF-16 as binary.
    """
    with open(filename, 'rb') as f:
        for block in f:
            if b'\0' in block:
                return True
    return False

Usually you have to guess.通常你必须猜测。

You can look at the extensions as one clue, if the files have them.如果文件有扩展名,您可以将扩展名视为一个线索。

You can also recognise know binary formats, and ignore those.您还可以识别已知的二进制格式,并忽略它们。

Otherwise see what proportion of non-printable ASCII bytes you have and take a guess from that.否则,查看您拥有的不可打印 ASCII 字节的比例并从中进行猜测。

You can also try decoding from UTF-8 and see if that produces sensible output.您还可以尝试从 UTF-8 解码,看看是否会产生合理的输出。

Try using the currently maintained python-magic which is not the same module in @Kami Kisiel's answer.尝试使用当前维护的python-magic ,它与@Kami Kisiel 的答案中的模块不同。 This does support all platforms including Windows however you will need the libmagic binary files.这确实支持包括 Windows 在内的所有平台,但是您将需要libmagic二进制文件。 This is explained in the README.这在 README 中有解释。

Unlike the mimetypes module, it doesn't use the file's extension and instead inspects the contents of the file.mimetypes模块不同,它不使用文件的扩展名,而是检查文件的内容。

>>> import magic
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'

If you're not on Windows, you can use Python Magic to determine the filetype.如果您不在 Windows 上,则可以使用Python Magic来确定文件类型。 Then you can check if it is a text/ mime type.然后您可以检查它是否是文本/ MIME 类型。

Most of the programs consider the file to be binary (which is any file that is not "line-oriented") if it contains a NULL character .如果文件包含NULL 字符,则大多数程序将文件视为二进制文件(即任何非“面向行”的文件)。

Here is perl's version of pp_fttext() ( pp_sys.c ) implemented in Python:这是在 Python 中实现的 perl 版本的pp_fttext() ( pp_sys.c ):

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

Note also that this code was written to run on both Python 2 and Python 3 without changes.另请注意,此代码是为在 Python 2 和 Python 3 上运行而无需更改而编写的。

Source: Perl's "guess if file is text or binary" implemented in Python来源: Perl 在 Python 中实现的“猜测文件是文本还是二进制文件”

Here's a function that first checks if the file starts with a BOM and if not looks for a zero byte within the initial 8192 bytes:这是一个函数,首先检查文件是否以 BOM 开头,如果不是,则在初始 8192 字节中查找零字节:

import codecs


#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
    codecs.BOM_UTF16_BE,
    codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE,
    codecs.BOM_UTF32_LE,
    codecs.BOM_UTF8,
)


def is_binary_file(source_path):
    with open(source_path, 'rb') as source_file:
        initial_bytes = source_file.read(8192)
    return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
           and b'\0' in initial_bytes

Technically the check for the UTF-8 BOM is unnecessary because it should not contain zero bytes for all practical purpose.从技术上讲,对 UTF-8 BOM 的检查是不必要的,因为出于所有实际目的,它不应包含零字节。 But as it is a very common encoding it's quicker to check for the BOM in the beginning instead of scanning all the 8192 bytes for 0.但由于它是一种非常常见的编码,因此在开始时检查 BOM 会更快,而不是扫描所有 8192 字节的 0。

from binaryornot.check import is_binary
is_binary('filename')

Documentation文档

I guess that the best solution is to use the guess_type function.我想最好的解决方案是使用 guess_type 函数。 It holds a list with several mimetypes and you can also include your own types.它包含一个包含多个 mimetype 的列表,您也可以包含自己的类型。 Here come the script that I did to solve my problem:这是我为解决问题所做的脚本:

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

It is inside of a Class, as you can see based on the ustructure of the code.它位于类内部,正如您可以根据代码的结构看到的那样。 But you can pretty much change the things you want to implement it inside your application.但是你几乎可以改变你想要在你的应用程序中实现它的东西。 It`s quite simple to use.使用起来非常简单。 The method getTextFiles returns a list object with all the text files that resides on the directory you pass in path variable.方法 getTextFiles 返回一个列表对象,其中包含驻留在您传入路径变量的目录中的所有文本文件。

on *NIX:在 *NIX:

If you have access to the file shell-command, shlex can help make the subprocess module more usable:如果您有权访问file shell-command,shlex 可以帮助提高子进程模块的可用性:

from os.path import realpath
from subprocess import check_output
from shlex import split

filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

Or, you could also stick that in a for-loop to get output for all files in the current dir using:或者,您也可以将其粘贴在 for 循环中以使用以下命令获取当前目录中所有文件的输出:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
    assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

or for all subdirs:或对于所有子目录:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
     for afile in filelist:
         assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

I came here looking for exactly the same thing--a comprehensive solution provided by the standard library to detect binary or text.我来到这里寻找完全相同的东西——标准库提供的用于检测二进制或文本的综合解决方案。 After reviewing the options people suggested, the nix file command looks to be the best choice (I'm only developing for linux boxen).在查看了人们建议的选项后,nix file命令看起来是最好的选择(我只是为 linux boxen 开发)。 Some others posted solutions using file but they are unnecessarily complicated in my opinion, so here's what I came up with:其他一些人使用文件发布了解决方案,但在我看来它们不必要地复杂,所以这是我想出的:

def test_file_isbinary(filename):
    cmd = shlex.split("file -b -e soft '{}'".format(filename))
    if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
        return False
    return True

It should go without saying, but your code that calls this function should make sure you can read a file before testing it, otherwise this will be mistakenly detect the file as binary.不言而喻,但是调用此函数的代码应确保在测试之前可以读取文件,否则会错误地将文件检测为二进制文件。

Simpler way is to check if the file consist NULL character ( \\x00 ) by using in operator, for instance:更简单的方法是使用in运算符检查文件是否包含 NULL 字符( \\x00 ),例如:

b'\x00' in open("foo.bar", 'rb').read()

See below the complete example:请参阅下面的完整示例:

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('file', nargs=1)
    args = parser.parse_args()
    with open(args.file[0], 'rb') as f:
        if b'\x00' in f.read():
            print('The file is binary!')
        else:
            print('The file is not binary!')

Sample usage:示例用法:

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!

All of these basic methods were incorporated into a Python library: binaryornot .所有这些基本方法都被合并到 Python 库中: binaryornot Install with pip.与 pip 一起安装。

From the documentation:从文档中:

>>> from binaryornot.check import is_binary
>>> is_binary('README.rst')
False

are you in unix?你在unix吗? if so, then try:如果是这样,请尝试:

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

The shell return values are inverted (0 is ok, so if it finds "text" then it will return a 0, and in Python that is a False expression). shell 返回值被反转(0 是可以的,所以如果它找到“text”,那么它将返回一个 0,在 Python 中它是一个 False 表达式)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在python中创建非文本二进制文件的精确副本 - how to create an exact copy of a non-text binary file in python 使用Python 3编写非文本文件 - Writing non-text files using Python 3 通过Python 3修改非文本文件 - Modifying non-text files via Python 3 将非文本文件读入Python - Reading non-text files into Python Python Scrapy - 基于mimetype的过滤器,以避免非文本文件下载 - Python Scrapy - mimetype based filter to avoid non-text file downloads 我有这个包含一堆字节和一些文本的非文本文件,我如何 go 将文本与 rest 干净地分开? - I have this non-text file that has a bunch of bytes and some text, how do I go about separating the text cleanly from the rest? 我通常如何允许非文本元素包含空格“文本”? - How do I generally allow non-text elements to contain whitespace "text"? 如何从 Google Cloud Functions 读取存储在 Google Cloud Storage 上的非文本文件 - How to read non-text file stored on Google Cloud Storage from Google Cloud Functions 如何从python中读取这个二进制文件,提供二进制文件,文本文件和代码? - How can I read this binary file from python, with binary file, text file, and code presented? 如何使用非文本分类注意机制为RNN建模? - How to model RNN with Attention Mechanism for Non-Text Classification?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM