Sklearn 錯誤：[Int64Index([2, 3], dtype='int64')] 均不在 [columns] 中

Question

有人可以解釋為什么這段代碼：

from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
import numpy as np

#df = pd.read_csv('missing_data.csv',sep=',')

df = pd.DataFrame(np.array([[1, 2, 3,4,5,6,7,8,9,1],
                            [4, 5, 6,3,4,5,7,5,4,1],
                            [7, 8, 9,6,2,3,6,5,4,1],
                            [7, 8, 9,6,1,3,2,2,4,0],
                            [7, 8, 9,6,5,6,6,5,4,0]]),
                            columns=['a', 'b', 'c','d','e','f','g','h','i','j'])

X_train = df.iloc[:,:-1]
y_train = df.iloc[:,-1]


clf=SVC(kernel='linear')
kfold = StratifiedKFold(n_splits=2,random_state=42,shuffle=True)
for train_index,test_index in kfold.split(X_train,y_train):
    x_train_fold,x_test_fold = X_train[train_index],X_train[test_index]
    y_train_fold,y_test_fold = y_train[train_index],y_train[test_index]
    clf.fit(x_train_fold,y_train_fold)

引發此錯誤：

Traceback (most recent call last):
  File "test_traintest.py", line 23, in <module>
    x_train_fold,x_test_fold = X_train[train_index],X_train[test_index]
  File "/Users/slowat/anaconda/envs/nlp_course/lib/python3.7/site-packages/pandas/core/frame.py", line 3030, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/Users/slowat/anaconda/envs/nlp_course/lib/python3.7/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/Users/slowat/anaconda/envs/nlp_course/lib/python3.7/site-packages/pandas/core/indexing.py", line 1308, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([2, 3], dtype='int64')] are in the [columns]"

我看到了這個答案，但是我的列的長度是相等的。

Answer 1

KFold.split()返回訓練和測試索引，它們應該與這樣的 DataFrame 一起使用：

X_train.iloc[train_index]

使用您的語法，您試圖將它們用作列名。 將您的代碼更改為：

from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
import numpy as np

#df = pd.read_csv('missing_data.csv',sep=',')

df = pd.DataFrame(np.array([[1, 2, 3,4,5,6,7,8,9,1],
                            [4, 5, 6,3,4,5,7,5,4,1],
                            [7, 8, 9,6,2,3,6,5,4,1],
                            [7, 8, 9,6,1,3,2,2,4,0],
                            [7, 8, 9,6,5,6,6,5,4,0]]),
                            columns=['a', 'b', 'c','d','e','f','g','h','i','j'])

X_train = df.iloc[:,:-1]
y_train = df.iloc[:,-1]


clf=SVC(kernel='linear')
kfold = StratifiedKFold(n_splits=2,random_state=42,shuffle=True)
for train_index,test_index in kfold.split(X_train,y_train):
    x_train_fold,x_test_fold = X_train.iloc[train_index],X_train.iloc[test_index]
    y_train_fold,y_test_fold = y_train.iloc[train_index],y_train.iloc[test_index]
    clf.fit(x_train_fold,y_train_fold)

請注意，我們使用.iloc而不是.loc 。 這是因為.iloc使用整數索引作為我們從split()獲得的索引，而.loc使用索引值。 在您的情況下，這無關緊要，因為 pandas 索引與整數索引匹配，但在其他項目中您會遇到的情況可能並非如此，因此請堅持使用.iloc 。

或者，當您提取X_train和y_train時，您可以將它們轉換為 numpy 數組：

X_train = df.iloc[:,:-1].to_numpy()
y_train = df.iloc[:,-1].to_numpy()

然后您的代碼將正常工作，因為 numpy 數組適用於整數索引。

Sklearn 錯誤：[Int64Index([2, 3], dtype='int64')] 均不在 [columns] 中

問題描述

1 個解決方案

解決方案1
1 已采納 2022-05-14 14:58:51

Sklearn 錯誤：[Int64Index([2, 3], dtype=&#39;int64&#39;)] 均不在 [columns] 中

問題描述

1 個解決方案

解決方案1 1 已采納 2022-05-14 14:58:51

Sklearn 錯誤：[Int64Index([2, 3], dtype='int64')] 均不在 [columns] 中

解決方案1
1 已采納 2022-05-14 14:58:51