简体   繁体   中英

Is there a way to remove invalid characters in Excel?

I want to read an Excel file with pandas in python. My Code is as simple as this:

import pandas as pd
data = pd.read_excel(open("excel.xlsx"),encoding='utf-8')

But I get the following error after running the script:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 16: character maps to <undefined>

It looks like there is at least one character that is 'invalid' in my excel file. I have tried to save the Excel file with a different name or and tried some other encodings that were suggested in other SO threads. But nothing resolved this issue. How can i get rid of those characters in my Excel file?

Xlsx file is a binary file, while open will try to read it as a text file and pass this on to read_excel, hence this fails to read it. Instead, use

data = pd.read_excel("excel.xlsx", encoding='utf-8')

If you want to use open (which is not needed in this case, as pandas automatically opens the file for you), you can do

data = pd.read_excel(open("excel.xlsx", mode='rb'))

Ori6151 is correct with the encoding needing to be "utf-8", also "utf-8-sig" works well.

I had to use the encoding "cp850" which stopped this error for me. It of course depends on what the character is it can't decode.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM