簡體   English   中英

Python,file(1) - 為什么數字[7,8,9,10,12,13,27]和范圍(0x20,0x100)用於確定文本vs二進制文件

[英]Python, file(1) - Why are the numbers [7,8,9,10,12,13,27] and range(0x20, 0x100) used for determining text vs binary file

關於在python中確定文件是二進制文件還是文本解決方案 ,應答者使用:

textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))

然后使用.translate(None, textchars)刪除(或替換為.translate(None, textchars)以二進制形式讀取的文件中的所有此類字符。

回答者還爭辯說,這種數字的選擇是“基於文件(1)行為”(對於什么是文本而不是什么)。 這些數字的重要性是從二進制文件中確定文本文件?

它們代表可打印文本的最常見代碼點,以及換行符,空格和回車符等。 ASCII被覆蓋到0x7F,像Latin-1或Windows Codepage 1251這樣的標准使用剩余的128個字節來表示重音字符等。

您希望文本使用這些代碼點。 二進制數據將使用0x00-0xFF范圍內的所有代碼點; 例如,文本文件可能不會使用\\ x00(NUL)或\\ x1F(ASCII標准中的單位分隔符)。

不過,它充其量只是一種啟發式方法。 某些文本文件仍然可以嘗試使用明確命名的7個字符之外的C0控制代碼 ,並且我確定存在的二進制數據恰好不包括textchars字符串中未包含的25個字節值。

范圍的作者可能基於file命令中的text_chars 它將字節標記為非文本,ASCII,Latin-1或非ISO擴展ASCII,並包含有關為何選擇這些代碼點的文檔:

/*
 * This table reflects a particular philosophy about what constitutes
 * "text," and there is room for disagreement about it.
 *
 * [....]
 *
 * The table below considers a file to be ASCII if all of its characters
 * are either ASCII printing characters (again, according to the X3.4
 * standard, not isascii()) or any of the following controls: bell,
 * backspace, tab, line feed, form feed, carriage return, esc, nextline.
 *
 * I include bell because some programs (particularly shell scripts)
 * use it literally, even though it is rare in normal text.  I exclude
 * vertical tab because it never seems to be used in real text.  I also
 * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
 * because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
 * character to.  It might be more appropriate to include it in the 8859
 * set instead of the ASCII set, but it's got to be included in *something*
 * we recognize or EBCDIC files aren't going to be considered textual.
 *
 * [.....]
 */

有趣的是,該表排除了 0x7F,你發現的代碼沒有。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM