在python中使用變音符號讀取/寫入文件（從html到txt）

Question

我知道這個問題已經被問過幾次了，但是我認為我做對了所有事情，但是仍然行不通，所以在我發瘋之前，我要發表一篇文章。 這是代碼（應該將HTML文件轉換為txt文件，並省略某些行）：

fid = codecs.open(htmlFile, "r", encoding = "utf-8")
if not fid:
    return
htmlText = fid.read()
fid.close()

stripped = strip_tags(unicode(htmlText))   ### strip html tags (this is not the prob)
lines = stripped.split('\n')
out = []

for line in lines: # just some stuff i want to leave out of the output
    if len(line) < 6:
        continue
    if '*' in line or '(' in line or '@' in line or ':' in line:
        continue
    out.append(line)

result=  '\n'.join(out)
base, ext = os.path.splitext(htmlFile)
outfile = base + '.txt'

fid = codecs.open(outfile, "w", encoding = 'utf-8')
fid.write(result)
fid.close()

謝謝！

Answer 1

不確定，但是這樣做

'\n'.join(out)

使用非unicode字符串（但使用普通的舊bytes字符串），您可能會退回到某些非UTF-8編解碼器。 嘗試：

u'\n'.join(out)

為了確保您到處都在使用unicode對象。

Answer 2

您尚未指定問題，所以這是一個完整的猜測。

strip_tags()函數返回什么？ 它返回的是unicode對象，還是字節字符串？ 如果是后者，則在您嘗試將其寫入文件時可能會導致解碼問題。 例如，如果strip_tags()返回utf-8編碼的字節字符串：

>>> s = u'This is \xe4 test\nHere is \xe4nother line.'
>>> print s
This is ä test
Here is änother line.

>>> s_utf8 = s.encode('utf-8')
>>> f=codecs.open('test', 'w', encoding='utf8')
>>> f.write(s)    # no problem with this... s is unicode, but
>>> f.write(s_utf8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib64/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

如果這是您所看到的，則需要確保在fid.write(result)傳遞unicode，這可能意味着確保strip_tags()返回unicode。

此外，我還注意到了其他幾件事：

如果無法打開文件，則codecs.open()將引發IOError異常。 它不會返回None，因此if not fid: test將if not fid: 。 您需要使用try/except ，最好與with一起with 。

try:
    with codecs.open(htmlFile, "r", encoding = "utf-8") as fid:
        htmlText = fid.read()
except IOError, e:
    # handle error
    print e

而且，您從通過codecs.open()打開的文件中讀取的數據將自動轉換為unicode，因此調用unicode(htmlText)無法實現任何操作。

在python中使用變音符號讀取/寫入文件（從html到txt）

問題描述

2 個解決方案

解決方案1
0 2012-07-19 23:23:06

解決方案2
0 2012-07-20 02:37:32

在python中使用變音符號讀取/寫入文件（從html到txt）

問題描述

2 個解決方案

解決方案1 0 2012-07-19 23:23:06

解決方案2 0 2012-07-20 02:37:32

解決方案1
0 2012-07-19 23:23:06

解決方案2
0 2012-07-20 02:37:32