編寫html文件時出現奇怪的字符open（file.html，'a'，encodint = utf-8）python

Question

我正在使用Jinja和BS4刮擦html並將其粘貼到新文件中，一切正常。

我為另一個客戶端修改了它，並拋出了一個編碼錯誤，我添加了編碼=utf-8 ，它再次起作用。

但是，有些文本現在是gobblydegook，老實說，我不知道它是什么，但不是ASCII

示例字符：Â¢這不是長破折號。

在該腳本的其他版本中，它無需編碼即可工作，並且拋出零個奇怪的字符。

完整的腳本在這里： git

違規的項目是第133行

打開HTML

f = open(new_file + '.html', 'a', encoding='utf-8')

message = result

f.write(message)# write result

f.close()#close html

*請注意，我尚未使用編碼推送新版本。

我通過使用請求通過URL從BeautifulSoup讀取它。

r = requests.get(ebay_url)
html_bytes = r.content

html_string = html_bytes.decode('UTF-8')

soup = bs4(html_string, 'html.parser')
description_source = soup.find("div", {"class":"dv3"})

Answer 1

使用.text （不是.content ！）從請求模塊獲取解碼的響應內容。 這樣，您不必手動解碼響應。 請求模塊將通過查看HTTP響應標頭自動選擇正確的編碼。

import codecs
import requests
from bs4 import BeautifulSoup as bs4

def get_ebay_item_html(item_id):
    ebay_url = 'http://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item='
    r = requests.get(ebay_url + str(item_id))

    return r.text

現在我們可以檢索一個項目：

item_to_revise = "271796888350"
item_html = get_ebay_item_html(item_to_revise)

...從中抓取數據：

soup = bs4(item_html , 'html.parser')
dv3 = soup.find("div", {"class":"dv3"})
print dv3

...保存到文件中：

with codecs.open("271796888350.html", "w", encoding="UTF-8") as f:
    f.write(item_html)

...從文件加載：

with codecs.open("271796888350.html", "r", encoding="UTF-8") as f:
    item_html = f.read()

...或將其發送到ebaysdk模塊。 為此，我強烈不建議使用這樣的結構： "<![CDATA["+ f.read() + "]]>" 。 無法以這種方式可靠地構建CDATA。 請改用適當的XML編碼功能，這樣更安全。

from xml.sax.saxutils import escape
from ebaysdk.trading import Connection as Trading

api = Trading(debug=args.debug, siteid=site_id, appid=app_id, token=token_id, config_file=None, certid=cert_id, devid=dev_id)

api.execute('ReviseFixedPriceItem', {
    "Item": {
        "Country": "GB",
        "Description": escape(item_html),
        "ItemID": item_to_revise
    }
})

實際上，ebaysdk模塊似乎支持escape_xml標志，該標志完全透明地執行上面的代碼。 我認為您應該使用它代替：

api = Trading(escape_xml=true, debug=args.debug, siteid=site_id, appid=app_id, token=token_id, config_file=None, certid=cert_id, devid=dev_id)

api.execute('ReviseFixedPriceItem', {
    "Item": {
        "Country": "GB",
        "Description": item_html,
        "ItemID": item_to_revise
    }
})

在我的測試中，所有角色在所有方面看起來都不錯。

Answer 2

f = open(new_file + '.html', 'a', encoding='utf-8')
x = f.read()

re.sub(ur'\\u2014','-',x)
re.sub(ur'\xc3\xa2\xc2\x80\xc2','-',x)
re.sub(ur'\xe3\xa2\xe2\x80\xe2','-',x)
print x

編寫html文件時出現奇怪的字符open（file.html，'a'，encodint = utf-8）python

問題描述

打開HTML

2 個解決方案

解決方案1
2 已采納 2017-05-29 12:55:55

解決方案2
-1 2017-05-29 10:20:15

編寫html文件時出現奇怪的字符open（file.html，&#39;a&#39;，encodint = utf-8）python

問題描述

打開HTML

2 個解決方案

解決方案1 2 已采納 2017-05-29 12:55:55

解決方案2 -1 2017-05-29 10:20:15

編寫html文件時出現奇怪的字符open（file.html，'a'，encodint = utf-8）python

解決方案1
2 已采納 2017-05-29 12:55:55

解決方案2
-1 2017-05-29 10:20:15