[英]How to enforce categorical integer dtype when writing to a csv file with pandas.DataFrame.to_csv
(The question has been updated to isolate the problem more strictly) (问题已更新以更严格地隔离问题)
I have data in a pandas.DataFrame with categorical variables.我在带有分类变量的 pandas.DataFrame 中有数据。 The categories are integers.
类别是整数。 The data may have missing values.
数据可能有缺失值。
import pandas as pd
# Define all dtypes
dtypes = {
'var_001': pd.api.types.CategoricalDtype(
categories=[1, 2, 3, 4],
ordered=False,
),
'var_002': pd.UInt8Dtype(),
'var_003': pd.api.types.CategoricalDtype(
categories=[1, 2, 3, 4, 5],
ordered=True,
),
}
# Create a dataframe
df = pd.DataFrame(
data={
'var_001': [1, '', 3],
'var_002': [43, 62, 99],
'var_003': [2, 3, 3],
},
)
# Convert to the right dtypes (btw, why this cannot be done in the construcor??)
df = df.astype(dtype=dtypes)
The dtypes seem good: dtypes 看起来不错:
>>> print(df.dtypes)
var_001 category
var_002 UInt8
var_003 category
dtype: object
As do the data in the dataframe:与数据框中的数据一样:
>>> print(df)
var_001 var_002 var_003
0 1 43 2
1 NaN 62 3
2 3 99 3
However, when I write the dataframe into a csv file ( df.to_csv('data.csv', index=False)
), the values of the variable with missing values get printed as float instead of integers:但是,当我将数据帧写入 csv 文件(
df.to_csv('data.csv', index=False)
)时,缺少值的变量的值将打印为浮点数而不是整数:
var_001, var_002, var_003
1.0, 43, 2
, 62, 3
3.0, 99, 3
Is there a way to keep the integer categories also for data with missing values when writing into a csv file?有没有办法在写入 csv 文件时为具有缺失值的数据保留整数类别?
Apparently, the problem lies with the non-nullable integers:显然,问题在于不可为空的整数:
In Working with missing data, we saw that pandas primarily uses NaN to represent missing data.
在处理缺失数据中,我们看到 Pandas 主要使用 NaN 来表示缺失数据。 Because NaN is a float, this forces an array of integers with any missing values to become floating point.
因为 NaN 是一个浮点数,这会强制一个包含任何缺失值的整数数组变成浮点数。
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
So, to enable missing values also with categorical variables, we must define the categories as nullable integers:因此,要使用分类变量也启用缺失值,我们必须将类别定义为可为空的整数:
import pandas as pd
# Create an array with nullable integer values
cat_0_4 = pd.array([0, 1, 2, 3, 4], dtype="Int8")
# Define an ordered categorical dtype with nullable integer values
var_dtype = pd.api.types.CategoricalDtype(
categories=cat_0_4,
ordered=True,
)
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.