Python，file（1） - 為什么數字[7,8,9,10,12,13,27]和范圍（0x20,0x100）用於確定文本vs二進制文件

Question

關於在python中確定文件是二進制文件還是文本的解決方案，應答者使用：

textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))

然后使用.translate(None, textchars)刪除（或替換為.translate(None, textchars)以二進制形式讀取的文件中的所有此類字符。

回答者還爭辯說，這種數字的選擇是“基於文件（1）行為”（對於什么是文本而不是什么）。 這些數字的重要性是從二進制文件中確定文本文件？

Answer 1

它們代表可打印文本的最常見代碼點，以及換行符，空格和回車符等。 ASCII被覆蓋到0x7F，像Latin-1或Windows Codepage 1251這樣的標准使用剩余的128個字節來表示重音字符等。

您希望文本僅使用這些代碼點。 二進制數據將使用0x00-0xFF范圍內的所有代碼點; 例如，文本文件可能不會使用\\ x00（NUL）或\\ x1F（ASCII標准中的單位分隔符）。

不過，它充其量只是一種啟發式方法。 某些文本文件仍然可以嘗試使用明確命名的7個字符之外的C0控制代碼，並且我確定存在的二進制數據恰好不包括textchars字符串中未包含的25個字節值。

范圍的作者可能基於file命令中的text_chars表。 它將字節標記為非文本，ASCII，Latin-1或非ISO擴展ASCII，並包含有關為何選擇這些代碼點的文檔：

/*
 * This table reflects a particular philosophy about what constitutes
 * "text," and there is room for disagreement about it.
 *
 * [....]
 *
 * The table below considers a file to be ASCII if all of its characters
 * are either ASCII printing characters (again, according to the X3.4
 * standard, not isascii()) or any of the following controls: bell,
 * backspace, tab, line feed, form feed, carriage return, esc, nextline.
 *
 * I include bell because some programs (particularly shell scripts)
 * use it literally, even though it is rare in normal text.  I exclude
 * vertical tab because it never seems to be used in real text.  I also
 * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
 * because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
 * character to.  It might be more appropriate to include it in the 8859
 * set instead of the ASCII set, but it's got to be included in *something*
 * we recognize or EBCDIC files aren't going to be considered textual.
 *
 * [.....]
 */

有趣的是，該表排除了 0x7F，你發現的代碼沒有。

Python，file（1） - 為什么數字[7,8,9,10,12,13,27]和范圍（0x20,0x100）用於確定文本vs二進制文件

問題描述

1 個解決方案

解決方案1
6 已采納 2015-08-24 14:29:06

Python，file（1） - 為什么數字[7,8,9,10,12,13,27]和范圍（0x20,0x100）用於確定文本vs二進制文件

問題描述

1 個解決方案

解決方案1 6 已采納 2015-08-24 14:29:06

解決方案1
6 已采納 2015-08-24 14:29:06