简体   繁体   中英

MS SQL Server Management Studio export to CSV introduces extra character when reading from pandas

I'm using MS SQL Server Management Studio and I have a simple table with the following data:

CountryId     CommonName  FormalName
---------     ----------  ----------
        1    Afghanistan  Islamic State of Afghanistan
        2        Albania  Republic of Albania
        3        Algeria  People's Democratic Republic of Algeria
        4        Andorra  Principality of Andorra

I use "Save Results As" to save this data into countries.csv using the default UTF8 encoding. Then I go into iPython and read it into a data frame using pandas:

df = pd.read_csv("countries.csv")

If I do

df.columns

I get:

Index([u'CountryId', u'CommonName', u'FormalName'], dtype='object')

The weird thing is that when I copy the column names, paste it into a new cell, and press Enter, I get:

u'\ufeffCountryId', u'CommonName', u'FormalName'

An unicode character \ shows up in the beginning of the first column name.

I tried the procedure with different tables and every time I got the extra character. And it happens to the first column name only.

Can anyone explain to me why the extra unicode character showed up?

Try using the encoding = "utf-8-sig" option with read_csv . For example:

df = pd.read_csv("countries.csv", encoding = "utf-8-sig")

That should get it to ignore the Unicode Byte Order Mark (BOM) at the start of the CSV file. The use of BOM unnecessary here as UTF-8 files don't have an byte order, but Microsoft tools like to use it as a magic number to identify UTF-8 encoded text files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM