是什么導致這些 Int64 列導致 TypeError？

Question

我有一個 pandas DataFrame 帶有幾個Int64類型的標志/虛擬變量。

我正在匯總其他字段並取平均值以計算百分比。

df.groupby(["key1", "key2"]).mean()

當我嘗試取平均值時，我得到TypeError: cannot safely cast non-equivalent float64 to int64.

當我嘗試逐一計算每一列的平均值時，我沒有收到錯誤。

我試圖了解可能導致錯誤的原因。 任何見解將不勝感激。

下面是對數據的描述：

In:

df.info()

Out:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6910491 entries, 82222 to 6858085
Data columns (total 5 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   key1       object
 1   key2       object
 2   cond1      int64 
 3   cond2      Int64 
 4   cond1and2  Int64 
dtypes: Int64(2), int64(1), object(2)
memory usage: 329.5+ MB

In:

df.describe()

Out:


    cond1   cond2   cond1and2
count   6.910491e+06    6.910491e+06    6.910491e+06
mean    2.004735e-02    1.050030e-01    6.695038e-03
std 1.401622e-01    3.065573e-01    8.154885e-02
min 0.000000e+00    0.000000e+00    0.000000e+00
25% 0.000000e+00    0.000000e+00    0.000000e+00
50% 0.000000e+00    0.000000e+00    0.000000e+00
75% 0.000000e+00    0.000000e+00    0.000000e+00
max 1.000000e+00    1.000000e+00    1.000000e+00

In: 

[print(df[c].value_counts(), "\n\n") for c in df]

Out:

c    2220221
d    2208322
b    2195117
a     286831
Name: key1, dtype: int64 


1    1925173
4    1680848
3    1656101
2    1648369
Name: key2, dtype: int64 


0    6771954
1     138537
Name: cond1, dtype: int64 


0    6184869
1     725622
Name: cond2, dtype: Int64 


0    6864225
1      46266
Name: cond1and2, dtype: Int64 


[None, None, None, None, None]

In: 

df.groupby(['key1', 'key2']).mean()

Out:

TypeError                                 Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in safe_cast(values, dtype, copy)
    143     try:
--> 144         return values.astype(dtype, casting="safe", copy=copy)
    145     except TypeError:

TypeError: Cannot cast array from dtype('float64') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-70-5cec730bfc37> in <module>
----> 1 df.groupby(['key1', 'key2']).mean()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in mean(self, *args, **kwargs)
   1230         nv.validate_groupby_func("mean", args, kwargs, ["numeric_only"])
   1231         return self._cython_agg_general(
-> 1232             "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
   1233         )
   1234 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
   1002     ) -> DataFrame:
   1003         agg_blocks, agg_items = self._cython_agg_blocks(
-> 1004             how, alt=alt, numeric_only=numeric_only, min_count=min_count
   1005         )
   1006         return self._wrap_agged_blocks(agg_blocks, items=agg_items)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
   1091                         # Cast back if feasible
   1092                         result = type(block.values)._from_sequence(
-> 1093                             result.ravel(), dtype=block.values.dtype
   1094                         )
   1095                     except ValueError:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in _from_sequence(cls, scalars, dtype, copy)
    348     @classmethod
    349     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 350         return integer_array(scalars, dtype=dtype, copy=copy)
    351 
    352     @classmethod

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in integer_array(values, dtype, copy)
    129     TypeError if incompatible types
    130     """
--> 131     values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
    132     return IntegerArray(values, mask)
    133 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in coerce_to_array(values, dtype, mask, copy)
    245         values = safe_cast(values, dtype, copy=False)
    246     else:
--> 247         values = safe_cast(values, dtype, copy=False)
    248 
    249     return values, mask

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in safe_cast(values, dtype, copy)
    150 
    151         raise TypeError(
--> 152             f"cannot safely cast non-equivalent {values.dtype} to {np.dtype(dtype)}"
    153         )
    154 

TypeError: cannot safely cast non-equivalent float64 to int64

Answer 1

Int64 （可空數組）與int64不同（在此處和此處閱讀更多相關信息）。

為了解決這個問題，請更改這些列的數據類型

df[['cond2', 'cond1and2']] = df[['cond2', 'cond1and2']].astype('int64')

或者

import numpy as np

df[['cond2', 'cond1and2']] = df[['cond2', 'cond1and2']].astype(np.int64)

注意：如果有缺失值（ df.describe()可能有助於檢測它們），有多種方法可以處理該問題，例如：刪除具有缺失值的行或填充缺失的單元格（在我的回答中有一個將看到一種查找和處理缺失值的方法）。

缺失值通常由超出范圍的條目指示； 可能是數字字段中通常僅為正數的負數（例如 -1），或者通常永遠不會為 0 的數字字段中的 0。（Witten，IH（2016）。數據挖掘：實用機器學習工具和技術）

有關缺失值的更多信息：

Answer 2

可能是您在初始 dataframe 中有Nan值（被視為浮點數），因此出現錯誤消息。

嘗試這個：

df = df.fillna(0)  # replace Nan values with 0
df.groupby(["key1", "key2"]).mean()

Answer 3

您可以更改所選列的數據類型，例如：

排除object類型，然后根據需要更改類型：

# This line will give you numeric type list

lst = list(df.select_dtypes(exclude= 'object').columns) 
df[lst] = df[lst].astype('int64')

是什么導致這些 Int64 列導致 TypeError？

問題描述

3 個解決方案

解決方案1
1 2022-04-30 13:26:48

解決方案2
0 2021-04-30 17:39:45

解決方案3
0 2021-04-30 22:13:16

是什么導致這些 Int64 列導致 TypeError？

問題描述

3 個解決方案

解決方案1 1 2022-04-30 13:26:48

解決方案2 0 2021-04-30 17:39:45

解決方案3 0 2021-04-30 22:13:16

解決方案1
1 2022-04-30 13:26:48

解決方案2
0 2021-04-30 17:39:45

解決方案3
0 2021-04-30 22:13:16