About Python BeautifulSoup output encoding (using Python 3.4.4 ):
How to combine soup.p.encode("utf-8") with soup.select('a') & .getText()?
Ie I can do one of the two but do not know how to do both... -> I want to use soup.p.encode("utf-8") because eg "Aloë" will otherwise transform to "aloë" in my output.
But I also want to use the soup object (type: ) to select the href object via "soup.select('a') and ".getText()". If I do the soup.p.encode("utf-8") first this is not possible because I get "AttributeError: 'bytes' object has no attribute 'select'.
But it seems that once I have transformed the soup object to a list and then a string it is too late to get the UTF-8 characters back? Eg text = text.decode('utf-8') does not work. I really can use some advice please!
FYI my code:
import requests, bs4
res = requests.get(url)
try:
res.raise_for_status()
except Exception as exc:
print('There was a problem: %s' % (exc))
soup = bs4.BeautifulSoup(res.text,"html.parser", from_encoding="UTF-8")
#soup = soup.encode("utf-8")
#type: <class 'bs4.BeautifulSoup'>
#print(soup.original_encoding) -> None...
aElems = soup.select('a')
#type: <class 'list'>
lengthElems = len(aElems)
for i in range (0, lengthElems):
text = aElems[i].getText()
#text = text.decode('utf-8')
link = aElems[i].get('href')
You can just set your system default encoding to UTF-8 within the script.
At the top of your file you want this:
import sys
if __name__ == "__main__":
reload(sys)
sys.setdefaultencoding("utf-8")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.