简体   繁体   English

将 pandas 列(包含浮点数和 NaN 值)从 float64 转换为可为空的 int8

[英]Convert pandas column (containing floats and NaN values) from float64 to nullable int8

I have a large dataframe that looks somewhat like this:我有一个大的 dataframe 看起来有点像这样:

    a   b   c
0   2.2 6.0 0.0
1   3.3 7.0 NaN
2   4.4 NaN 3.0
3   5.5 9.0 NaN

Columns b and c contain float values that are either postive, natural numbers or NaN. b 列和 c 列包含正数、自然数或 NaN 的浮点值。 However, they are stored as float64, which is a problem, since (without going into further detail) this dataframe is the input of a pipeline that requires these to be integers, so and I want to store them as such.但是,它们存储为 float64,这是一个问题,因为(无需进一步详细说明)此 dataframe 是要求这些为整数的管道的输入,因此我想将它们存储为这样。 The output should look like this: output 应如下所示:

    a   b   c
0   2.2 6   0
1   3.3 7   NaN
2   4.4 NaN 3
3   5.5 9   NaN

I read in the pandas documentation that nullable integers are only supported in the pandas datatype "Int8" (note: this is different from np.int8), so naturally, I attempted this:我在 pandas 文档中读到,可空整数仅在 pandas 数据类型“Int8”中受支持(注意:这与 np.int8 不同),所以很自然地,我尝试了这个:

df = df.astype({'b':pd.Int8Dtype(), 'c':pd.Int8Dtype()})

This works when I run it in my Jupyter notebook, but when I integrate it within a larger function, I get this error:这在我的 Jupyter 笔记本中运行时有效,但是当我将它集成到更大的 function 中时,我收到此错误:

TypeError: cannot safely cast non-equivalent float64 to int8

I understand why I get the error, since x == int(x), will be False for NaN values, so the program thinks this conversion is unsafe, even though all values are either NaN or natural number.我理解为什么会出现错误,因为 x == int(x) 对于 NaN 值将是 False,因此程序认为这种转换是不安全的,即使所有值都是 NaN 或自然数。 So next, I tried:所以接下来,我尝试了:

'df = df.astype({'b':pd.Int8Dtype(), 'c':pd.Int8Dtype()}, errors='ignore')

I figured that this would get rid of the 'unsafe conversion' problem, since I am 100% sure all float64 values are natural numbers.我认为这将摆脱“不安全转换”问题,因为我 100% 确定所有 float64 值都是自然数。 However, when I use this line, all of my numbers are still stored as floats!但是,当我使用这条线时,我所有的数字仍然存储为浮点数! Infuriating!真气!

Does anyone have a workaround for this?有没有人有解决方法?

I ran into exactly the same issue which led me to this page.我遇到了完全相同的问题,导致我进入此页面。 I do not have a genuinely good solution for this issue and am seeking for one myself... but I did find a workaround.对于这个问题,我没有真正好的解决方案,我自己也在寻找一个……但我确实找到了解决方法。 Before going into that I would like to answer to the comment posted on the original question that: allowing to have NA or even None values assigned to series of such 'simple' types as int8 is the whole point of trying to make these dtype conversions.在开始之前,我想回答关于原始问题的评论:允许将NA甚至None值分配给诸如int8之类的“简单”类型系列是尝试进行这些 dtype 转换的重点。 It is possible to perform the typical operations such as isna() (and so on) on series of these dtypes (see pd.Int X Dtype() where ' X ' stands for the number of bits).可以对一系列这些 dtype 执行典型的操作,例如isna() (等等)(参见 pd.Int X Dtype() ,其中“ X ”代表位数)。 The advantage I explore by using these dtypes is on memory footprint, eg:我通过使用这些 dtypes 探索的优势在于 memory 足迹,例如:

In[56]: test_df = pd.Series(np.zeros(1_000_000), dtype=np.float64)

In[57]: test_df.memory_usage()
Out[57]: 8000128

In[58]: test_df = pd.Series(np.zeros(1_000_000), dtype=pd.Int8Dtype())

In[59]: test_df.memory_usage()
Out[59]: 2000128

In[60]: test_df.iloc[:500_000] = None

In[61]: test_df.memory_usage()
Out[61]: 2000128

In[62]: test_df.isna().sum()
Out[62]: 500000

So you get the best of both worlds.所以你得到了两全其美。

Now the workarround:现在解决方法:

In[33]: my_df
Out[33]: 
     a    s      d
0    0 -500 -1.000
1    1 -499 -0.998
2    2 -498 -0.996
3    3 -497 -0.994
4    4 -496 -0.992

In[34]: my_df.dtypes
Out[34]: 
a      int64
s      int64
d    float64
dtype: object

In[35]: df_converted_to_int_first = my_df.astype(
   ...:     dtype={
   ...:         'a': np.int8,
   ...:         's': np.int16,
   ...:         'd': np.float16,
   ...:     },
   ...: )

In[36]: df_converted_to_int_first
Out[36]: 
     a    s         d
0    0 -500 -1.000000
1    1 -499 -0.998047
2    2 -498 -0.996094
3    3 -497 -0.994141
4    4 -496 -0.992188

In[37]: df_converted_to_int_first.dtypes
Out[37]: 
a       int8
s      int16
d    float16
dtype: object

In[38]: df_converted_to_special_int_after = df_converted_to_int_first.astype(
   ...:     dtype={
   ...:         'a': pd.Int8Dtype(),
   ...:         's': pd.Int16Dtype(),
   ...:     }
   ...: )

In[39]: df_converted_to_special_int_after.dtypes
Out[39]: 
a       Int8
s      Int16
d    float16
dtype: object

In[40]: df_converted_to_special_int_after.a.iloc[3] = None

In[41]: df_converted_to_special_int_after
Out[41]: 
       a     s         d
0      0  -500 -1.000000
1      1  -499 -0.998047
2      2  -498 -0.996094
3   <NA>  -497 -0.994141
4      4  -496 -0.992188

This is still not an acceptable solution in my opinion... but as mentioned above ir constitutes a workaround which is asked in the original question.在我看来,这仍然不是一个可接受的解决方案......但如上所述,ir 构成了原始问题中提出的解决方法。

EDIT Some test that was missing, from np.float64 to pd.Int8Dtype():编辑一些缺少的测试,从 np.float64 到 pd.Int8Dtype():

In[67]: my_df.astype(
   ...:     dtype={
   ...:         'a': np.int8,
   ...:         's': np.int16,
   ...:         'd': np.int16,
   ...:     },
   ...: ).astype(    
   ...:     dtype={
   ...:         'a': np.int8,
   ...:         's': np.int16,
   ...:         'd': pd.Int8Dtype(),
   ...:     },
   ...: ).dtypes

Out[67]: 
a     int8
s    int16
d     Int8
dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM