Python写入文本文件跳过坏行

Question

已解决：问题与 Python 版本有关，请参阅 stackoverflow.com/a/5513856/2540382

我正在摆弄htm -> txt文件转换，但遇到了一些麻烦。 我的项目基本上是在转换messages.htm我下载我的Facebook聊天记录的文件转换成messages.txt所有的文件<>括号去掉和格式保存。

文件messages.htm被解析为变量text 。

然后我运行：

target = open('output.txt', 'w')
target.write(text)
target.close

这似乎有效，除非我遇到无效字符。 如以下错误所示。 有没有办法：

写入时跳过包含无效字符的行？
找出无效字符的位置并删除相应的字符或行？

期望的结果是尽可能避免将奇怪的字符放在一起。

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U000fe333' in position 37524: character
maps to <undefined>

Answer 1

target = open('output.txt', 'wb')
target.write(text.encode('ascii', 'ignore'))
target.close()

对于 .encode(..) 的 "errors" 参数，'ignore' 将删除这些字符，而 'replace' 将用 '?' 替换它们。

为了测试这一点，我用

target.write(u"foo\U000fe333bar".encode("ascii", "ignore"))

并确认 output.txt 只包含“foobar”。

更新：我将open(.., 'w')为open(.., 'wb')以确保这也适用于 Python 3。

Python写入文本文件跳过坏行

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-10-31 05:27:31

Python写入文本文件跳过坏行

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-10-31 05:27:31

解决方案1
3 已采纳 2015-10-31 05:27:31