[英]Python, file(1) - Why are the numbers [7,8,9,10,12,13,27] and range(0x20, 0x100) used for determining text vs binary file
關於在python中確定文件是二進制文件還是文本的解決方案 ,應答者使用:
textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))
然后使用.translate(None, textchars)
刪除(或替換為.translate(None, textchars)
以二進制形式讀取的文件中的所有此類字符。
回答者還爭辯說,這種數字的選擇是“基於文件(1)行為”(對於什么是文本而不是什么)。 這些數字的重要性是從二進制文件中確定文本文件?
它們代表可打印文本的最常見代碼點,以及換行符,空格和回車符等。 ASCII被覆蓋到0x7F,像Latin-1或Windows Codepage 1251這樣的標准使用剩余的128個字節來表示重音字符等。
您希望文本僅使用這些代碼點。 二進制數據將使用0x00-0xFF范圍內的所有代碼點; 例如,文本文件可能不會使用\\ x00(NUL)或\\ x1F(ASCII標准中的單位分隔符)。
不過,它充其量只是一種啟發式方法。 某些文本文件仍然可以嘗試使用明確命名的7個字符之外的C0控制代碼 ,並且我確定存在的二進制數據恰好不包括textchars
字符串中未包含的25個字節值。
范圍的作者可能基於file
命令中的text_chars
表 。 它將字節標記為非文本,ASCII,Latin-1或非ISO擴展ASCII,並包含有關為何選擇這些代碼點的文檔:
/*
* This table reflects a particular philosophy about what constitutes
* "text," and there is room for disagreement about it.
*
* [....]
*
* The table below considers a file to be ASCII if all of its characters
* are either ASCII printing characters (again, according to the X3.4
* standard, not isascii()) or any of the following controls: bell,
* backspace, tab, line feed, form feed, carriage return, esc, nextline.
*
* I include bell because some programs (particularly shell scripts)
* use it literally, even though it is rare in normal text. I exclude
* vertical tab because it never seems to be used in real text. I also
* include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
* because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
* character to. It might be more appropriate to include it in the 8859
* set instead of the ASCII set, but it's got to be included in *something*
* we recognize or EBCDIC files aren't going to be considered textual.
*
* [.....]
*/
有趣的是,該表排除了 0x7F,你發現的代碼沒有。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.