[英]Using pandas pivot_table with Interval columns results in TypeError
cat1 cat2 col_a col_b
0 (34.0, 38.0] (15.9, 47.0] 29 10
1 (34.0, 38.0] (15.9, 47.0] 37 22
2 (28.0, 34.0] (47.0, 56.0] 3 13
3 (34.0, 38.0] (47.0, 56.0] 15 7
4 (28.0, 34.0] (56.0, 67.0] 42 20
5 (28.0, 34.0] (47.0, 56.0] 31 23
6 (28.0, 34.0] (56.0, 67.0] 26 17
7 (28.0, 34.0] (56.0, 67.0] 7 1
8 (28.0, 34.0] (56.0, 67.0] 36 19
9 (19.0, 28.0] (56.0, 67.0] 5 7
10 (19.0, 28.0] (56.0, 67.0] 21 5
11 (28.0, 34.0] (67.0, 84.0] 37 13
在上面的數據框中,我想使用熊貓執行此數據透視表操作
pd.pivot_table(df, index='cat1', columns='cat2', values='col_a')
但是我得到了錯誤:
TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'
col_a
和col_b
均為int32類型,而cat1
和cat2
均為類別。 我如何擺脫這個錯誤?
這是一個與樞軸間隔相關的錯誤(請參見GH25814 ),並將針對v0.25進行修復。 另請參見使用crosstab
相關問題: CategoricalDType列上的熊貓交叉表會引發TypeError
同時,這里有一些選擇。 要進行匯總,您將必須使用pivot_table
並將類別cols轉換為字符串,然后再進行pivot_table
。
df2 = df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
# to aggregate by taking the mean of col_a
df2.pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean')
需要注意的是,您失去了索引和列為間隔的好處。
另一個選擇是繞開分類代碼,然后重新分配類別:
df2 = df.assign(cat1=df['cat1'].cat.codes, cat2=df['cat2'].cat.codes)
pivot = df2.pivot_table(
index='cat1', columns='cat2', values='col_a', aggfunc='mean')
pivot.index = df['cat1'].cat.categories
pivot.columns = df['cat2'].cat.categories
該分配將起作用,因為pivot_table
會pivot_table
對間隔進行排序。
最少的代碼樣本
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'cat1': np.random.choice(100, 10),
'cat2': np.random.choice(100, 10),
'col_a': np.random.randint(1, 50, 10)})
df['cat1'] = pd.cut(df['cat1'], bins=np.arange(0, 101, 10))
df['cat2'] = pd.cut(df['cat2'], bins=np.arange(0, 101, 10))
df
A B C
0 (40, 50] (60, 70] 18
1 (40, 50] (80, 90] 38
2 (60, 70] (80, 90] 26
3 (60, 70] (10, 20] 14
4 (60, 70] (50, 60] 9
5 (0, 10] (60, 70] 10
6 (80, 90] (30, 40] 21
7 (20, 30] (80, 90] 17
8 (30, 40] (40, 50] 6
9 (80, 90] (80, 90] 16
(df.assign(cat1=df['cat1'].astype(str), cat2=df['cat2'].astype(str))
.pivot_table(index='cat1', columns='cat2', values='col_a', aggfunc='mean'))
cat2 (10, 20] (30, 40] (40, 50] (50, 60] (60, 70] (80, 90]
cat1
(0, 10] NaN NaN NaN NaN 10.0 NaN
(20, 30] NaN NaN NaN NaN NaN 17.0
(30, 40] NaN NaN 6.0 NaN NaN NaN
(40, 50] NaN NaN NaN NaN 18.0 38.0
(60, 70] 14.0 NaN NaN 9.0 NaN 26.0
(80, 90] NaN 21.0 NaN NaN NaN 16.0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.