如何：在Python中删除特殊字符后的部分Unicode字符串

Question

首先是简短的总结：

python ver：3.1系统：Linux（Ubuntu）

我正在尝试通过Python和BeautifulSoup进行一些数据检索。

不幸的是，我要处理的某些表包含以下文本字符串存在的单元格：

789.82±10.28

为此，我需要做两件事：

如何处理“怪异”符号，例如：±，以及如何删除字符串中包含：±以及此符号右边所有内容的部分？

目前，我收到类似以下错误：SyntaxError：文件中的非ASCII字符'\\ xc2'......

谢谢您的帮助

[编辑]：

# dataretriveal from html files from DETHERM
# -*- coding: utf8 -*-

import sys,os,re
from BeautifulSoup import BeautifulSoup


sys.path.insert(0, os.getcwd())

raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)


for numdiv in soup.findAll('div', {"id" : "sec"}):
    currenttable = numdiv.find('table',{"class" : "data"})
    if currenttable:
        numrow=0
        for row in currenttable.findAll('td', {"class" : "dataHead"}):
            numrow=numrow+1

        for col in currenttable.findAll('td'):
            col2 = ''.join(col.findAll(text=True))
            if col2.index('±'):
                col2=col2[:col2.indeindex('±')]
            print(col)
        print(numrow)
        ref=numdiv.find('a')
        niceref=''.join(ref.findAll(text=True))
        print(niceref)

现在，此代码后跟着一个错误：

UnicodeDecodeError：'ascii'编解码器无法解码位置0的字节0xc2：序数不在范围内（128）

ASCII引用从哪里弹出？

Answer 1

您需要将您的Python文件编码为utf-8。 否则，这很简单：

>>> s = '789.82 ± 10.28'
>>> s[:s.index('±')]
'789.82 '
>>> s.partition('±')
('789.82 ', '±', ' 10.28')

如何：在Python中删除特殊字符后的部分Unicode字符串

问题描述

1 个解决方案

解决方案1
0 已采纳 2010-10-07 17:02:35

如何：在Python中删除特殊字符后的部分Unicode字符串

问题描述

1 个解决方案

解决方案1 0 已采纳 2010-10-07 17:02:35

解决方案1
0 已采纳 2010-10-07 17:02:35