简体   繁体   中英

Pandas read_csv dtype convets column incorrectly

If I have the following CSV

"1"
"2"
"23"

and I read it

names = ["nullable"]
dtype = [("nullable", 'int32')]
df = pd.read_csv(r"E:\work\nullable.csv",
                 names=names,
                 dtype=dtype,
                 encoding = "utf-8")

Looking at df.info() :

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
nullable    3 non-null int32
dtypes: int32(1)
memory usage: 140.0 bytes
None

If I add a "" (a NaN ) to the CSV and change the dtype to pd.Int32Dtype the df.info() shows object type.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
nullable    3 non-null object
dtypes: object(1)
memory usage: 160.0+ bytes
None

However if I do

s = pd.Series([1, 2.0, np.nan, 4.0])

s2 = s.astype('Int32')

The dtype is correctly filled in as Int32

s2.info()
AttributeError("'Series' object has no attribute 'info'")
s2
0      1
1      2
2    NaN
3      4
dtype: Int32

This looks like a bug to me.

Are there any suggestions on how to work around this? Since I want to save the CSV as parquet, but if I use pd.Int32Dtype the column is saved as a string.

It's not feasible to remove or replace NaN s.

Pandas read_csv interprets 'NaN' as Null but not 'NAN'. You can pass 'NAN' to the na_values argument.

df = pd.read_csv(r"E:\work\nullable.csv",
                 names=names,
                 dtype=dtype,
                 encoding = "utf-8",
                 na_values = 'NAN'
            )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM