Python HTML到文本文件UnicodeDecodeError？

Question

因此，我正在編寫一個程序來使用urllib讀取網頁，然后使用“ html2text”將基本文本寫入文件。 但是，urllib.read（）給出的原始內容具有各種字符，因此它將不斷引發UnicodeDecodeError 。

我當然用Google搜索了3個小時，得到了很多答案，例如使用HTMLParser或reload（sys），使用pdfkit或BeautifulSoup等外部模塊，當然還有.encode / .decode。

重新加載sys，然后執行sys.setdefaultencoding（“ utf-8”）可以為我提供所需的結果，但是IDLE和此后的程序變得無響應。

我用'utf-8'和'ascii'嘗試了.encode / .decode的每個變體，並使用了'replace'，'ignore'等參數。由於某種原因，無論我使用什么參數，每次都會產生相同的錯誤提供編碼/解碼。

def download(self, url, name="WebPage.txt"):
    ## Saves only the text to file
    page = urllib.urlopen(url)
    content = page.read()
    with open(name, 'wb') as w:
        HP_inst = HTMLParser.HTMLParser()
        content = content.encode('ascii', 'xmlcharrefreplace')
        if True: 
            #w.write(HTT.html2text( (HP_inst.unescape( content ) ).encode('utf-8') ) )
            w.write( HTT.html2text( content) )#.decode('ascii', 'ignore')  ))
            w.close()
            print "Saved!"

我必須缺少其他方法或編碼...請幫助！

Side Quest：有時我必須將其寫入一個文件名，其中包含不受支持的字符，例如“ G \\ u00e9za Teleki” +“。txt” 。 如何過濾掉這些字符？

注意：

此函數存儲在一個類中（提示“ self”）。
使用python2.7
不想使用BeautfiulSoup
Windows 8 64位

Answer 1

您應該使用正確的編碼對從urllib獲取的內容進行解碼，例如utf-8 latin1取決於您獲取的頁面。

檢測內容編碼的方式多種多樣。 來自標頭或html中的meta。 我想使用一個編碼檢測模塊，但忘記了名稱，可以用google搜索。

正確解碼后，您可以將其編碼為所需的任何編碼，然后再寫入文件

=====================================

這是使用chardet的示例

import urllib
import chardet


def main():
    page = urllib.urlopen('http://bbc.com')
    content = page.read()

    # detect the encoding
    try:
        encoding = chardet.detect(content)['encoding']
    except:
        # use utf-8 as default encoding
        encoding = 'utf-8'

    # decode the content into unicode
    content = content.decode(encoding)

    # write to file
    with open('test.txt', 'wb') as f:
        f.write(content.encode('utf-8'))

Answer 2

您必須知道遠程網頁使用的編碼。 有很多方法可以實現這一點，但是最簡單的方法是使用Python-Requests庫而不是urllib。 請求返回預解碼的Unicode對象。

然后，您可以使用編碼文件包裝器自動對編寫的每個字符進行編碼。

import requests
import io

def download(self, url, name="WebPage.txt"):
    ## Saves only the text to file
    req = requests.get(url)
    content = req.text # Returns a Unicode object decoded using the server's header
    with io.open(name, 'w', encoding="utf-8") as w: # Everything written to w is encoded to UTF-8
        w.write( HTT.html2text( content) )

    print "Saved"

Python HTML到文本文件UnicodeDecodeError？

問題描述

2 個解決方案

解決方案1
0 2015-11-28 01:52:31

解決方案2
0 2015-11-28 15:46:35

Python HTML到文本文件UnicodeDecodeError？

問題描述

2 個解決方案

解決方案1 0 2015-11-28 01:52:31

解決方案2 0 2015-11-28 15:46:35

解決方案1
0 2015-11-28 01:52:31

解決方案2
0 2015-11-28 15:46:35