简体   繁体   English

编写html文件时出现奇怪的字符open(file.html,'a',encodint = utf-8)python

[英]Strange characters when writing html file open(file.html, 'a', encodint=utf-8) python

I am using Jinja and BS4 to scrape html and paste into a new file everything works fine. 我正在使用Jinja和BS4刮擦html并将其粘贴到新文件中,一切正常。

I modified it for another client and it threw me an encoding errror, I added the encoding =utf-8 and it worked again. 我为另一个客户端修改了它,并抛出了一个编码错误,我添加了编码=utf-8 ,它再次起作用。

However some of the text is now gobblydegook, in all honesty I don't know what it is but its not ASCII 但是,有些文本现在是gobblydegook,老实说,我不知道它是什么,但不是ASCII

example characters: – this is instead of a long dash. 示例字符:¢这不是长破折号。

With the other version of the script it works without having to encode and also throws out zero strange characters. 在该脚本的其他版本中,它无需编码即可工作,并且抛出零个奇怪的字符。

The full script is here: git 完整的脚本在这里: git

the offending item is line 133 违规的项目是第133行

open the html 打开HTML

f = open(new_file + '.html', 'a', encoding='utf-8')

message = result

f.write(message)# write result

f.close()#close html

*Please note i have not pushed the new version with the encoding.. *请注意,我尚未使用编码推送新版本。

Im reading it from BeautifulSoup via a URL using requests.. 我通过使用请求通过URL从BeautifulSoup读取它。

r = requests.get(ebay_url)
html_bytes = r.content

html_string = html_bytes.decode('UTF-8')

soup = bs4(html_string, 'html.parser')
description_source = soup.find("div", {"class":"dv3"})

Use .text (not .content !) to get the decoded response content from the requests module. 使用.text (不是.content !)从请求模块获取解码的响应内容 This way you don't have to manually decode the response. 这样,您不必手动解码响应。 The requests module will automatically pick the proper encoding by looking at the HTTP response headers. 请求模块将通过查看HTTP响应标头自动选择正确的编码。

import codecs
import requests
from bs4 import BeautifulSoup as bs4

def get_ebay_item_html(item_id):
    ebay_url = 'http://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item='
    r = requests.get(ebay_url + str(item_id))

    return r.text

Now we can retrieve an item: 现在我们可以检索一个项目:

item_to_revise = "271796888350"
item_html = get_ebay_item_html(item_to_revise)

...scrape data from it: ...从中抓取数据:

soup = bs4(item_html , 'html.parser')
dv3 = soup.find("div", {"class":"dv3"})
print dv3

...save it to a file: ...保存到文件中:

with codecs.open("271796888350.html", "w", encoding="UTF-8") as f:
    f.write(item_html)

...load it from a file: ...从文件加载:

with codecs.open("271796888350.html", "r", encoding="UTF-8") as f:
    item_html = f.read()

...or send it to the ebaysdk module . ...或将其发送到ebaysdk模块 For this I strongly dis-recommend using constructions like this one: "<![CDATA["+ f.read() + "]]>" . 为此,我强烈不建议使用这样的结构: "<![CDATA["+ f.read() + "]]>" CDATA cannot reliably be built this way. 无法以这种方式可靠地构建CDATA。 Use a proper XML encoding function instead, it's safer. 请改用适当的XML编码功能 ,这样更安全。

from xml.sax.saxutils import escape
from ebaysdk.trading import Connection as Trading

api = Trading(debug=args.debug, siteid=site_id, appid=app_id, token=token_id, config_file=None, certid=cert_id, devid=dev_id)

api.execute('ReviseFixedPriceItem', {
    "Item": {
        "Country": "GB",
        "Description": escape(item_html),
        "ItemID": item_to_revise
    }
})

In fact, the ebaysdk module appears to support an escape_xml flag, which transparently does exactly what the code above does. 实际上,ebaysdk模块似乎支持escape_xml标志,该标志完全透明地执行上面的代码。 I think you should use that instead: 我认为您应该使用它代替:

api = Trading(escape_xml=true, debug=args.debug, siteid=site_id, appid=app_id, token=token_id, config_file=None, certid=cert_id, devid=dev_id)

api.execute('ReviseFixedPriceItem', {
    "Item": {
        "Country": "GB",
        "Description": item_html,
        "ItemID": item_to_revise
    }
})

In my tests all characters looked fine at all points. 在我的测试中,所有角色在所有方面看起来都不错。

f = open(new_file + '.html', 'a', encoding='utf-8')
x = f.read()

re.sub(ur'\\u2014','-',x)
re.sub(ur'\xc3\xa2\xc2\x80\xc2','-',x)
re.sub(ur'\xe3\xa2\xe2\x80\xe2','-',x)
print x

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM