[英]Convert pandas column (containing floats and NaN values) from float64 to nullable int8
I have a large dataframe that looks somewhat like this:我有一个大的 dataframe 看起来有点像这样:
a b c
0 2.2 6.0 0.0
1 3.3 7.0 NaN
2 4.4 NaN 3.0
3 5.5 9.0 NaN
Columns b and c contain float values that are either postive, natural numbers or NaN. b 列和 c 列包含正数、自然数或 NaN 的浮点值。 However, they are stored as float64, which is a problem, since (without going into further detail) this dataframe is the input of a pipeline that requires these to be integers, so and I want to store them as such.
但是,它们存储为 float64,这是一个问题,因为(无需进一步详细说明)此 dataframe 是要求这些为整数的管道的输入,因此我想将它们存储为这样。 The output should look like this:
output 应如下所示:
a b c
0 2.2 6 0
1 3.3 7 NaN
2 4.4 NaN 3
3 5.5 9 NaN
I read in the pandas documentation that nullable integers are only supported in the pandas datatype "Int8" (note: this is different from np.int8), so naturally, I attempted this:我在 pandas 文档中读到,可空整数仅在 pandas 数据类型“Int8”中受支持(注意:这与 np.int8 不同),所以很自然地,我尝试了这个:
df = df.astype({'b':pd.Int8Dtype(), 'c':pd.Int8Dtype()})
This works when I run it in my Jupyter notebook, but when I integrate it within a larger function, I get this error:这在我的 Jupyter 笔记本中运行时有效,但是当我将它集成到更大的 function 中时,我收到此错误:
TypeError: cannot safely cast non-equivalent float64 to int8
I understand why I get the error, since x == int(x), will be False for NaN values, so the program thinks this conversion is unsafe, even though all values are either NaN or natural number.我理解为什么会出现错误,因为 x == int(x) 对于 NaN 值将是 False,因此程序认为这种转换是不安全的,即使所有值都是 NaN 或自然数。 So next, I tried:
所以接下来,我尝试了:
'df = df.astype({'b':pd.Int8Dtype(), 'c':pd.Int8Dtype()}, errors='ignore')
I figured that this would get rid of the 'unsafe conversion' problem, since I am 100% sure all float64 values are natural numbers.我认为这将摆脱“不安全转换”问题,因为我 100% 确定所有 float64 值都是自然数。 However, when I use this line, all of my numbers are still stored as floats!
但是,当我使用这条线时,我所有的数字仍然存储为浮点数! Infuriating!
真气!
Does anyone have a workaround for this?有没有人有解决方法?
I ran into exactly the same issue which led me to this page.我遇到了完全相同的问题,导致我进入此页面。 I do not have a genuinely good solution for this issue and am seeking for one myself... but I did find a workaround.
对于这个问题,我没有真正好的解决方案,我自己也在寻找一个……但我确实找到了解决方法。 Before going into that I would like to answer to the comment posted on the original question that: allowing to have
NA
or even None
values assigned to series of such 'simple' types as int8
is the whole point of trying to make these dtype conversions.在开始之前,我想回答关于原始问题的评论:允许将
NA
甚至None
值分配给诸如int8
之类的“简单”类型系列是尝试进行这些 dtype 转换的重点。 It is possible to perform the typical operations such as isna()
(and so on) on series of these dtypes (see pd.Int X Dtype() where ' X ' stands for the number of bits).可以对一系列这些 dtype 执行典型的操作,例如
isna()
(等等)(参见 pd.Int X Dtype() ,其中“ X ”代表位数)。 The advantage I explore by using these dtypes is on memory footprint, eg:我通过使用这些 dtypes 探索的优势在于 memory 足迹,例如:
In[56]: test_df = pd.Series(np.zeros(1_000_000), dtype=np.float64)
In[57]: test_df.memory_usage()
Out[57]: 8000128
In[58]: test_df = pd.Series(np.zeros(1_000_000), dtype=pd.Int8Dtype())
In[59]: test_df.memory_usage()
Out[59]: 2000128
In[60]: test_df.iloc[:500_000] = None
In[61]: test_df.memory_usage()
Out[61]: 2000128
In[62]: test_df.isna().sum()
Out[62]: 500000
So you get the best of both worlds.所以你得到了两全其美。
Now the workarround:现在解决方法:
In[33]: my_df
Out[33]:
a s d
0 0 -500 -1.000
1 1 -499 -0.998
2 2 -498 -0.996
3 3 -497 -0.994
4 4 -496 -0.992
In[34]: my_df.dtypes
Out[34]:
a int64
s int64
d float64
dtype: object
In[35]: df_converted_to_int_first = my_df.astype(
...: dtype={
...: 'a': np.int8,
...: 's': np.int16,
...: 'd': np.float16,
...: },
...: )
In[36]: df_converted_to_int_first
Out[36]:
a s d
0 0 -500 -1.000000
1 1 -499 -0.998047
2 2 -498 -0.996094
3 3 -497 -0.994141
4 4 -496 -0.992188
In[37]: df_converted_to_int_first.dtypes
Out[37]:
a int8
s int16
d float16
dtype: object
In[38]: df_converted_to_special_int_after = df_converted_to_int_first.astype(
...: dtype={
...: 'a': pd.Int8Dtype(),
...: 's': pd.Int16Dtype(),
...: }
...: )
In[39]: df_converted_to_special_int_after.dtypes
Out[39]:
a Int8
s Int16
d float16
dtype: object
In[40]: df_converted_to_special_int_after.a.iloc[3] = None
In[41]: df_converted_to_special_int_after
Out[41]:
a s d
0 0 -500 -1.000000
1 1 -499 -0.998047
2 2 -498 -0.996094
3 <NA> -497 -0.994141
4 4 -496 -0.992188
This is still not an acceptable solution in my opinion... but as mentioned above ir constitutes a workaround which is asked in the original question.在我看来,这仍然不是一个可接受的解决方案......但如上所述,ir 构成了原始问题中提出的解决方法。
EDIT Some test that was missing, from np.float64 to pd.Int8Dtype():编辑一些缺少的测试,从 np.float64 到 pd.Int8Dtype():
In[67]: my_df.astype(
...: dtype={
...: 'a': np.int8,
...: 's': np.int16,
...: 'd': np.int16,
...: },
...: ).astype(
...: dtype={
...: 'a': np.int8,
...: 's': np.int16,
...: 'd': pd.Int8Dtype(),
...: },
...: ).dtypes
Out[67]:
a int8
s int16
d Int8
dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.