使用urllib在python中刪除換行符

Question

我正在使用Python3.x。 在使用urllib.request下載網頁時，我在這之間得到了很多\\n 。 我正在嘗試使用論壇其他主題中提供的方法將其刪除，但我無法這樣做。 我使用了strip()函數和replace()函數...但是沒有運氣！ 我在Eclipse上運行此代碼。 這是我的代碼：

import urllib.request

#Downloading entire Web Document 
def download_page(a):
    opener = urllib.request.FancyURLopener({})
    try:
        open_url = opener.open(a)
        page = str(open_url.read())
        return page
    except:
        return""  
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)

#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)

我無法找出在raw_html變量中獲得大量\\n的raw_html 。

Answer 1

您的download_page()函數破壞了html（ str()調用），這就是為什么您在輸出中看到\\n （兩個字符\\和n ）的原因。 不要使用.replace()或其他類似的解決方案，而是要修復download_page()函數：

from urllib.request import urlopen

with urlopen("http://www.zseries.in") as response:
    html_content = response.read()

此時html_content包含一個bytes對象。 要以文本形式獲取它，您需要知道其字符編碼，例如，要從Content-Type http標頭中獲取它：

encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)

請參閱在Python中獲取HTTP響應的字符集/編碼的好方法。

如果服務器未在Content-Type標頭中傳遞字符集，則存在復雜的規則來找出html5文檔中的字符編碼，例如，它可以在html文檔中指定： <meta charset="utf-8"> （您可能會需要一個HTML解析器來獲取它）。

如果您正確閱讀了html，則在頁面中不會看到文字字符\\n 。

Answer 2

如果您查看已下載的源，則嘗試replace() \\n轉義序列實際上是自己轉義的： \\\\n 。 嘗試以下方法：

import urllib.request

def download_page(a):
    opener = urllib.request.FancyURLopener({})
    open_url = opener.open(a)
    page = str(open_url.read()).replace('\\n', '')
    return page

我刪除了try / except子句，因為沒有針對特定異常（或異常類）的泛型except語句通常是不好的。 如果失敗，您將不知道為什么。

Answer 3

好像它們是文字\\n字符，所以我建議您這樣做。

raw_html2 = raw_html.replace('\\n', '')

使用urllib在python中刪除換行符

問題描述

3 個解決方案

解決方案1
7 2014-12-28 06:37:09

解決方案2
1 2014-12-28 06:18:12

解決方案3
1 已采納 2014-12-28 06:18:12

使用urllib在python中刪除換行符

問題描述

3 個解決方案

解決方案1 7 2014-12-28 06:37:09

解決方案2 1 2014-12-28 06:18:12

解決方案3 1 已采納 2014-12-28 06:18:12

解決方案1
7 2014-12-28 06:37:09

解決方案2
1 2014-12-28 06:18:12

解決方案3
1 已采納 2014-12-28 06:18:12