简体   繁体   中英

BeautifulSoup output encoding: how to combine soup.p.encode(“utf-8”) with soup.select('a') & .getText()

About Python BeautifulSoup output encoding (using Python 3.4.4 ):

How to combine soup.p.encode("utf-8") with soup.select('a') & .getText()?

Ie I can do one of the two but do not know how to do both... -> I want to use soup.p.encode("utf-8") because eg "Aloë" will otherwise transform to "aloë" in my output.

But I also want to use the soup object (type: ) to select the href object via "soup.select('a') and ".getText()". If I do the soup.p.encode("utf-8") first this is not possible because I get "AttributeError: 'bytes' object has no attribute 'select'.

But it seems that once I have transformed the soup object to a list and then a string it is too late to get the UTF-8 characters back? Eg text = text.decode('utf-8') does not work. I really can use some advice please!

FYI my code:

import requests, bs4

res = requests.get(url)
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

soup = bs4.BeautifulSoup(res.text,"html.parser", from_encoding="UTF-8")
#soup = soup.encode("utf-8")
#type: <class 'bs4.BeautifulSoup'>
#print(soup.original_encoding) -> None...
aElems = soup.select('a')
#type: <class 'list'>
lengthElems = len(aElems)

for i in range (0, lengthElems):
    text = aElems[i].getText()
    #text = text.decode('utf-8')
    link = aElems[i].get('href')

You can just set your system default encoding to UTF-8 within the script.

At the top of your file you want this:

import sys

if __name__ == "__main__":
    reload(sys)
    sys.setdefaultencoding("utf-8")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM