简体   繁体   English

如何在不转换为utf-8的情况下处理无效的unicode BeautifulSoup?

[英]How to deal with invalid unicode BeautifulSoup without converting to utf-8?

So I am trying to parse a website's HTML with python and there is one troublesome character u011f that gives the following error: 因此,我尝试使用python解析网站的HTML,并且有一个麻烦的字符u011f给出以下错误:

Function call:   soup = BeautifulSoup(response, "html.parser")
                 print (soup)

Error: UnicodeEncodeError: 'charmap' codec can't encode character '\u011f'

If I do this instead and encode to utf-8, 如果我改用utf-8编码,

soup = BeautifulSoup(response, "html.parser").encode('utf-8') 

It removes the error, but I cannot do that because I am calling the find function later, and it must be in unicode. 它消除了错误,但是我不能这样做,因为稍后要调用find函数,并且它必须使用unicode。 If I call the find function after encoding to utf-8, I receive the following error: 如果在编码为utf-8后调用find函数,则会收到以下错误:

Function call:   worksTable = soup.find('tbody', attrs={'id': 'some_id'})
Error: TypeError: find() takes no keyword arguments

I have spent hours already on this code and could not find any answers here that fit my case. 我已经花了几个小时在这段代码上,在这里找不到适合我的情况的任何答案。 Any help would be appreciated. 任何帮助,将不胜感激。

When you encode the soup, it becomes a string . 在对汤进行编码时,它变成了string

The following line of code 下面的代码行

BeautifulSoup(response, "html.parser").encode('utf-8')

will return a string object and so will not support a find(tagname, attrs={}) method call which should be used on a BeautifulSoup object. 将返回一个字符串对象,因此将不支持应在BeautifulSoup对象上使用的find(tagname, attrs={})方法调用。

I think you should encode the response text before making the soup to get a better result. 我认为您应该在制作汤之前对响应文本进行编码,以获得更好的结果。

responseTxt = response.text.encode('UTF-8')
soup = BeautifulSoup(responseTxt, 'html.parser')
idv = soup.find('tbody', attrs={'id': 'some_id'})
print(idv.text)

So I found out that it was problem with my desktop codec. 因此,我发现我的桌面编解码器存在问题。 The same code runs fine on my laptop. 相同的代码可以在我的笔记本电脑上正常运行。 I'm very confused about this, but will find a way to manage. 我对此很困惑,但是会找到一种管理方法。

您可以尝试在find()而不是汤上使用encode() ,这是一个示例:

worksTable = soup.find('tbody', attrs={'id': 'some_id'}).text.encode('utf-8')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM