简体   繁体   English

在熊猫数据框中查找和替换坏字符

[英]Finding and replacing bad characters in pandas dataframe

I'm getting stuck trying to get rid of bad characters in a pandas dataframe.我在试图摆脱熊猫数据框中的坏字符时陷入困境。 This is an automated script that processes incoming data that needs to be saved in cp1252, and I want to be able to handle any problem characters on the fly by parsing the error.这是一个自动化脚本,用于处理需要保存在 cp1252 中的传入数据,我希望能够通过解析错误来即时处理任何有问题的字符。 I don't care what they are replaced with.我不在乎它们被替换成什么。 I've tried a million variations on this and can't get anywhere (this is python 3 pandas 25)我已经尝试过一百万种变化,但一无所获(这是 python 3 pandas 25)

while True:
    try:
        print('saving')
        data.to_csv('total.csv', index=False, quoting=csv.QUOTE_ALL, encoding='cp1252')
        break
    except UnicodeEncodeError as e:
        print(e)
        badchar = re.search(r"character (.+?) in", str(e)).group(1)
        print('Found bad character, removing. . . ')
        uchar = u"{}".format(badchar)
        print(uchar)
        data = data.replace(uchar.encode('utf-8'), '')

Returns:返回:

saving
'charmap' codec can't encode character '\u2264' in position 399: character maps to <undefined>
Found bad character, removing. . . 
'\u2264'
saving
'charmap' codec can't encode character '\u2264' in position 399: character maps to <undefined>
Found bad character, removing. . . 
'\u2264'
saving
'charmap' codec can't encode character '\u2264' in position 399: character maps to <undefined>
Found bad character, removing. . . 
'\u2264'
saving

I've tried a ton of variations:我尝试了很多变体:

data = data.replace(uchar, '')

data = data.replace(uchar.encode('utf-8').decode('utf-8'), '') etc. . data = data.replace(uchar.encode('utf-8').decode('utf-8'), '')等等。 . .

I also tried u'\\2264', u'u\\2264'我也试过 u'\\2264', u'u\\2264'

I can't find this either in the dataframe.我在数据框中也找不到这个。 This returns nothing:这不返回任何内容:

for col in data:
    if sum(data[col].astype(str).str.contains(u'\2264')) > 0:
        print(col)

Any help would be appreciated thanks!任何帮助将不胜感激谢谢!

您必须使用正则表达式替换功能: https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

df.replace(to_replace=r'^ba.$', value='new', regex=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM