[英]How to remove certain utf-8 characters from a string?
In my case I want to remove specifically the „
and the ”
characters from a string. 就我而言,我想从字符串中专门删除
„
和”
字符。 I use BeautifulSoup to parse certain html paragraphs, and get a substring from them. 我使用BeautifulSoup解析某些html段落,并从中获取子字符串。 So far my code looks like this:
到目前为止,我的代码如下所示:
# -*- coding: cp1252 -*-
from bs4 import BeautifulSoup as bs
import re
soup = bs(open("file.xhtml"), 'html.parser')
for tag in soup.find_all('p', {"class": "fnp2"}) :
line = unicode(str(tag).split(':')[0], "utf-8")
line = re.sub('(<p class="fnp2">)(\d+) ', '', line)
line = line.replace('„', '')
print line
But for that, I always receive a UnicodeDecodeError
: 但是为此,我总是收到
UnicodeDecodeError
:
line = line.replace('„', '')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position
0: ordinal not in range(128)
What would be a solution for this? 有什么解决方案呢?
The line
variable in your code is a unicode
object. 您代码中的
line
变量是unicode
对象。 When you call line.replace
Python expects the first argument to also be a unicode
object. 当您调用
line.replace
Python期望第一个参数也是unicode
对象。 If you provide a str
object instead, Python will try to automatically decode it into a unicode
string using the system default encoding (which you can check via sys.getdefaultencoding()
). 如果改为提供
str
对象,Python将尝试使用系统默认编码(可以通过sys.getdefaultencoding()
检查sys.getdefaultencoding()
将其自动解码为unicode
字符串。
Apparently, the system encoding is ascii
in your case. 显然,在您的情况下,系统编码为
ascii
。 The byte string '„'
cannot be decoded using the ascii
codec, because '„'
is not an ACII symbol, which causes the Exception that you see. 不能使用
ascii
编解码器解码字节字符串'„'
,因为'„'
不是ACII符号,这会导致您看到异常。
You could fix the problem by changing the default system encoding to the same one you used to provide the '„'
string (CP1252, I guess), however such a fix is only interesting from the academic point of view, as it just sweeps the issue under the carpet. 您可以通过将默认系统编码更改为用于提供
'„'
字符串的相同系统编码 (CP1252,我猜)来解决该问题,但是这种解决方法仅从学术角度来看很有趣,因为它只是扫地毯下的问题。
A proper, safe and easy fix to your problem would be to simply provide a unicode
object to the replace
method in the first place. 解决您的问题的适当,安全且容易的方法是,首先简单地为
replace
方法提供一个unicode
对象。 This would be as simple as replacing '„'
with u'„'
in your code. 这就像在代码中用
u'„'
替换'„'
一样简单。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.