简体   繁体   中英

Prevent Pandas from coercing ints to floats when creating a dataframe

I create a data frame from 11 lists. Four of these lists are lists of ints, while the remaining seven are lists of floats. I create a dataframe from all 11 lists using

df = pd.DataFrame({  col_headers[0]  : pd.Series(upper_time,   dtype='float'), 
                     col_headers[1]  : pd.Series(upper_pres,   dtype='float'),
                     col_headers[2]  : pd.Series(upper_indx,   dtype='int'),
                     col_headers[3]  : pd.Series(upper_pulses, dtype='int'), 
                     col_headers[4]  : pd.Series(median_upper_pulses, dtype='float'),
                     col_headers[5]  : pd.Series(lower_time,   dtype='float'),
                     col_headers[6]  : pd.Series(lower_pres,   dtype='float'), 
                     col_headers[7]  : pd.Series(lower_indx,   dtype='int'),
                     col_headers[8]  : pd.Series(lower_pulses, dtype='int'), 
                     col_headers[9]  : pd.Series(median_lower_pulses, dtype='float'),
                     col_headers[10] : pd.Series(median_both_pulses,  dtype='float')
                        })

Unfortunately, when I type df.dtypes. i get

df.dtypes
Upper Systole Time              float64
Upper Systole Pressure          float64
Upper Systole Index               int32
Upper Systole Pulses              int32
Median Upper Systolic Pulses    float64
Lower Systole Time              float64
Lower Systole Pressure          float64
Lower Systole Index             float64
Lower Systole Pulses            float64
Median Lower Systolic Pulses    float64
Median Both Systolic Pulses     float64
dtype: object

Upper Systole Index, Lower Systole Index, Upper Systole Pulses and Lower Systole Pulses should all be ints (and they are if I check the type of every element in the relevant lists). But somehow, when I create a dataframe, two of the four ints get coerced to floats in spite of my explicit direction to keep them as ints.

I suspect that this has something to do with the fact that lists 0-4 have one length, and lists 5-10 have a different length, but lots of Googling and searching through StackOverflow has not thrown up an answer.

How can I ensure that my ints remain ints?

If you do the following:

pd.DataFrame({"A":pd.Series([1,2,3,4], dtype='int'),
             "B": pd.Series([1,3], dtype='int')}).astype(int)

You will get the following error:

    867         if not np.isfinite(arr).all():
--> 868             raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
    869 
    870     elif is_object_dtype(arr):

ValueError: Cannot convert non-finite values (NA or inf) to integer

Which indicates that the issue is the presence of NaNs.

If you were to convert your NaN values to integers, say, 0 for example, then you should be able to coerce the specified columns to integers with .astype(int)

Example:

df = pd.DataFrame({"A":pd.Series([1,2,3,4], dtype='int'),
             "B": pd.Series([1,3], dtype='int')})

df["B"] = df["B"].fillna(0).astype(int)

filippo, Thank you very much - dytpe = 'Int64' with a capital 'I' did the trick. I was unaware of this, and it is nicely written up at https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html , where it is stated that pd.Int64Dtype() is aliased to 'Int64'.

Thanks again

Thomas Philips

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM