简体   繁体   中英

Strange characters when writing html file open(file.html, 'a', encodint=utf-8) python

I am using Jinja and BS4 to scrape html and paste into a new file everything works fine.

I modified it for another client and it threw me an encoding errror, I added the encoding =utf-8 and it worked again.

However some of the text is now gobblydegook, in all honesty I don't know what it is but its not ASCII

example characters: – this is instead of a long dash.

With the other version of the script it works without having to encode and also throws out zero strange characters.

The full script is here: git

the offending item is line 133

open the html

f = open(new_file + '.html', 'a', encoding='utf-8')

message = result

f.write(message)# write result

f.close()#close html

*Please note i have not pushed the new version with the encoding..

Im reading it from BeautifulSoup via a URL using requests..

r = requests.get(ebay_url)
html_bytes = r.content

html_string = html_bytes.decode('UTF-8')

soup = bs4(html_string, 'html.parser')
description_source = soup.find("div", {"class":"dv3"})

Use .text (not .content !) to get the decoded response content from the requests module. This way you don't have to manually decode the response. The requests module will automatically pick the proper encoding by looking at the HTTP response headers.

import codecs
import requests
from bs4 import BeautifulSoup as bs4

def get_ebay_item_html(item_id):
    ebay_url = 'http://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item='
    r = requests.get(ebay_url + str(item_id))

    return r.text

Now we can retrieve an item:

item_to_revise = "271796888350"
item_html = get_ebay_item_html(item_to_revise)

...scrape data from it:

soup = bs4(item_html , 'html.parser')
dv3 = soup.find("div", {"class":"dv3"})
print dv3

...save it to a file:

with codecs.open("271796888350.html", "w", encoding="UTF-8") as f:
    f.write(item_html)

...load it from a file:

with codecs.open("271796888350.html", "r", encoding="UTF-8") as f:
    item_html = f.read()

...or send it to the ebaysdk module . For this I strongly dis-recommend using constructions like this one: "<![CDATA["+ f.read() + "]]>" . CDATA cannot reliably be built this way. Use a proper XML encoding function instead, it's safer.

from xml.sax.saxutils import escape
from ebaysdk.trading import Connection as Trading

api = Trading(debug=args.debug, siteid=site_id, appid=app_id, token=token_id, config_file=None, certid=cert_id, devid=dev_id)

api.execute('ReviseFixedPriceItem', {
    "Item": {
        "Country": "GB",
        "Description": escape(item_html),
        "ItemID": item_to_revise
    }
})

In fact, the ebaysdk module appears to support an escape_xml flag, which transparently does exactly what the code above does. I think you should use that instead:

api = Trading(escape_xml=true, debug=args.debug, siteid=site_id, appid=app_id, token=token_id, config_file=None, certid=cert_id, devid=dev_id)

api.execute('ReviseFixedPriceItem', {
    "Item": {
        "Country": "GB",
        "Description": item_html,
        "ItemID": item_to_revise
    }
})

In my tests all characters looked fine at all points.

f = open(new_file + '.html', 'a', encoding='utf-8')
x = f.read()

re.sub(ur'\\u2014','-',x)
re.sub(ur'\xc3\xa2\xc2\x80\xc2','-',x)
re.sub(ur'\xe3\xa2\xe2\x80\xe2','-',x)
print x

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM