简体   繁体   English

从熊猫数据框中删除特定字符

[英]Removing specific characters from a pandas dataframe

I have a csv file which seems to have several values which have junk data that look like: ‡_¤Ëçéè_Â… 我有一个csv文件,该文件似乎包含具有垃圾数据的多个值,如下所示:‡_¤ËççéÃ__…

I have imported the file into a pandas dataframe. 我已将文件导入到pandas数据框。 How can I get rid of these characters? 我如何摆脱这些角色? I would like to delete the contents of the cell which have such characters and put in a flag value instead (something like -99999). 我想删除具有此类字符的单元格的内容,并改用标志值(例如-99999)。 The table has mixed data types. 该表具有混合数据类型。

import pandas as pd
import codecs
import unicodedata
import csv
import StringIO

testData = pd.read_csv('Data.csv', encoding="iso-8859-1", engine='python')

/ Using encoding utf-8 gives me an error about invalid start byte, using default engine doesn't work either. / 使用utf-8编码会给我一个有关无效起始字节的错误,使用默认引擎也不起作用。 / /

Any suggestions? 有什么建议么?

IF you know what characters you are willing to accept, you could use a regex to filter your values, something like: 如果您知道愿意接受哪些字符,则可以使用正则表达式来过滤值,例如:

testData['stringcol'].where(testData['stringcol'].str.contains('[^A-Za-z0-9\s]'), 
-999999)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM