简体   繁体   English

我有这个包含一堆字节和一些文本的非文本文件,我如何 go 将文本与 rest 干净地分开?

[英]I have this non-text file that has a bunch of bytes and some text, how do I go about separating the text cleanly from the rest?

The file is relatively long (around 3MB), so it's not something that can be done manually and the amount of text in it can amount to probably more than a thousand lines scattered all over it (and there are line breaks too, so the text is properly formatted).该文件相对较长(大约 3MB),所以它不是可以手动完成的,其中的文本量可能达到一千多行分散在它上面(并且也有换行符,所以文本格式正确)。 I have no indication of formatting in regards to where a byte section ends and where the text section starts (the text is in bytes too, this isn't a txt file), aside from a chunk of text being surrounded by bytes and then there being another chunk of text.关于字节部分的结束位置和文本部分的开始位置(文本也以字节为单位,这不是 txt 文件),我没有任何格式的指示,除了一大块被字节包围的文本然后在那里是另一块文本。 Deleting all non-ASCII characters in notepad++ does remove a good portion of it, but there is still a whole bunch of other stuff left out.删除 notepad++ 中的所有非 ASCII 字符确实会删除其中的很大一部分,但仍有一大堆其他内容遗漏。

Preferred language is Python.首选语言是 Python。

Open the file with the encoding which seems to match contents (probably utf8 ) and just ignore all errors:使用似乎与内容匹配的编码(可能是utf8 )打开文件,然后忽略所有错误:

with open("my_file", encoding="utf8", errors="ignore") as f:
   for i, line in enumerate(f, 1):
       # do something with line

See UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?读取文件时看到Python中的UnicodeDecodeError,如何忽略错误并跳转到下一行? for more information.了解更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM