具有 pandas read_json 的列 dtype

Question

我有一個看起來像這樣的 json 文件：

[{"A": 0, "B": "x"}, {"A": 1, "B": "y", "C": 0}, {"A": 2, "B": "z", "C": 1}]

由於“C”列包含一個 NaN 值（第一行），pandas 自動推斷其 dtype 為“float64”：

>>> pd.read_json(path).C.dtype
dtype('float64')

但是，我希望“C”列的 dtype 為“Int32”。 pd.read_json(path, dtype={"C": "Int32"})不起作用：

>>> pd.read_json(path, dtype={"C": "Int32"}).C.dtype
dtype('float64')

相反， pd.read_json(path).astype({"C": "Int32"})確實有效：

>>> pd.read_json(path).astype({"C": "Int32"}).C.dtype
Int32Dtype()

為什么會這樣？ 如何僅使用pd.read_json function 設置正確的 dtype？

Answer 1

原因在此代碼部分：

        dtype = (
            self.dtype.get(name) if isinstance(self.dtype, dict) else self.dtype
        )
        if dtype is not None:
            try:
                dtype = np.dtype(dtype)
                return data.astype(dtype), True
            except (TypeError, ValueError):
                return data, False

它將'Int32'轉換為numpy.int32 ，然后在嘗試將整個列（數組）轉換為此類型時導致值錯誤（無法將非有限值（NA 或 inf）轉換為整數）。 因此，原始（未轉換的）數據將在異常塊中返回。
我想這是 pandas 中的某種錯誤，至少該行為沒有正確記錄。

另一方面， astype的工作方式不同：它在系列上按元素應用'astype' ），因此可以創建一個混合類型的列。

有趣的是，當直接指定擴展類型pd.Int32Dtype()時（而不是它的字符串別名'Int32' ），乍一看你會得到想要的結果，但如果你再看看它們仍然是浮點數的類型：

df = pd.read_json(json, dtype={"C": pd.Int32Dtype})
print(df)
#   A  B    C
#0  0  x  NaN
#1  1  y    0
#2  2  z    1
print(df.C.map(type))
#0    <class 'float'>
#1    <class 'float'>
#2    <class 'float'>
#Name: C, dtype: object

為了比較：

print(df.C.astype('Int32').map(type))
#0    <class 'pandas._libs.missing.NAType'>
#1                            <class 'int'>
#2                            <class 'int'>
#Name: C, dtype: object

具有 pandas read_json 的列 dtype

問題描述

1 個解決方案

解決方案1
2 已采納 2020-05-28 10:24:03

具有 pandas read_json 的列 dtype

問題描述

1 個解決方案

解決方案1 2 已采納 2020-05-28 10:24:03

解決方案1
2 已采納 2020-05-28 10:24:03