简体   繁体   English

如何从字符串中删除某些utf-8字符?

[英]How to remove certain utf-8 characters from a string?

In my case I want to remove specifically the and the characters from a string. 就我而言,我想从字符串中专门删除字符。 I use BeautifulSoup to parse certain html paragraphs, and get a substring from them. 我使用BeautifulSoup解析某些html段落,并从中获取子字符串。 So far my code looks like this: 到目前为止,我的代码如下所示:

# -*- coding: cp1252 -*-
from bs4 import BeautifulSoup as bs
import re

soup = bs(open("file.xhtml"), 'html.parser')

for tag in soup.find_all('p', {"class": "fnp2"}) :
    line = unicode(str(tag).split(':')[0], "utf-8")
    line = re.sub('(<p class="fnp2">)(\d+) ', '', line)
    line = line.replace('„', '')
    print line

But for that, I always receive a UnicodeDecodeError : 但是为此,我总是收到UnicodeDecodeError

line = line.replace('„', '')

UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position
0: ordinal not in range(128)

What would be a solution for this? 有什么解决方案呢?

The line variable in your code is a unicode object. 您代码中的line变量是unicode对象。 When you call line.replace Python expects the first argument to also be a unicode object. 当您调用line.replace Python期望第一个参数也是unicode对象。 If you provide a str object instead, Python will try to automatically decode it into a unicode string using the system default encoding (which you can check via sys.getdefaultencoding() ). 如果改为提供str对象,Python将尝试使用系统默认编码(可以通过sys.getdefaultencoding()检查sys.getdefaultencoding()将其自动解码为unicode字符串。

Apparently, the system encoding is ascii in your case. 显然,在您的情况下,系统编码为ascii The byte string '„' cannot be decoded using the ascii codec, because '„' is not an ACII symbol, which causes the Exception that you see. 不能使用ascii编解码器解码字节字符串'„' ,因为'„'不是ACII符号,这会导致您看到异常。

You could fix the problem by changing the default system encoding to the same one you used to provide the '„' string (CP1252, I guess), however such a fix is only interesting from the academic point of view, as it just sweeps the issue under the carpet. 您可以通过将默认系统编码更改为用于提供'„'字符串的相同系统编码 (CP1252,我猜)来解决该问题,但是这种解决方法仅从学术角度来看很有趣,因为它只是扫地毯下的问题。

A proper, safe and easy fix to your problem would be to simply provide a unicode object to the replace method in the first place. 解决您的问题的适当,安全且容易的方法是,首先简单地为replace方法提供一个unicode对象。 This would be as simple as replacing '„' with u'„' in your code. 这就像在代码中用u'„'替换'„'一样简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM