Python Mlens Ensemble：KeyError：“[Int64Index([... dtype='int64', length=105)] 均不在 [columns] 中”

Question

Following is a small version of code where I'm getting this error: KeyError: "None of [Int64Index([...], dtype='int64')] are in the [columns]"以下是我收到此错误的代码的小版本： KeyError：“[Int64Index([...], dtype='int64')] 均不在 [columns] 中”

'...' is a series of numbers that seem to match the index of my X and y dataframes. '...' 是一系列数字，似乎与我的 X 和 y 数据帧的索引相匹配。

I am using the Mlens package to model with SuperLearner on an very large dataset (so scalability is important).我在一个非常大的数据集上使用带有 SuperLearner 的 Mlens package 到 model（因此可扩展性很重要）。 My goal is to use a dataframe structure rather than a Numpy array.我的目标是使用 dataframe 结构而不是 Numpy 数组。 This will solve downstream issues.这将解决下游问题。

So far, I've explored this and other related posts, but the solutions do not seem to apply here.到目前为止，我已经探索了这篇文章和其他相关文章，但解决方案似乎不适用于这里。

The dataset is the Iris dataset found here as a.csv: <https://datahub.io/machine-learning/iris#data/该数据集是在这里找到的鸢尾花数据集，名称为 a.csv：<https://datahub.io/machine-learning/iris#data/

Note, that a custom Random Forest function works well.请注意，自定义随机森林 function 效果很好。 But the mlens/SuperLearner errors.但是 mlens/SuperLearner 错误。

from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from mlens.ensemble.super_learner import SuperLearner
import numpy as np
import pandas as pd

df = pd.read_csv("/home/marktest/iris_csv.csv")
type(df)
N_FOLDS = 5
RF_ESTIMATORS = 100
RANDOM_STATE = 42
class RFBasedFeatureSelector(BaseEstimator):
  
    def __init__(self, n_estimators):
        self.n_estimators = n_estimators
        self.selector = None

    def fit(self, X, y):
        clf = RandomForestClassifier(n_estimators=self.n_estimators, random_state = RANDOM_STATE, class_weight = 'balanced')
        clf = clf.fit(X, y)
        self.selector = SelectFromModel(clf, prefit=True, threshold = 0.01)

    def transform(self, X):
        if self.selector is None:
            raise AttributeError('The selector attribute has not been assigned. You cannot call transform before first calling fit or fit_transform.')
        return self.selector.transform(X)

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)
df.head()
X = df.iloc[:,0:3]                                    # split off features into new dataframe
y = df.iloc[:,4]                                     # split off outcome into new dataframe

X, X_val, y, y_val = train_test_split(X, y, test_size=.3, random_state=RANDOM_STATE, stratify=y)
from mlens.metrics import make_scorer
from sklearn.metrics import roc_auc_score, balanced_accuracy_score
accuracy_scorer = make_scorer(roc_auc_score, average='micro', greater_is_better=True)

clf = RandomForestClassifier(RF_ESTIMATORS, random_state=RANDOM_STATE,class_weight='balanced')
scaler = StandardScaler()
feature_selector = RFBasedFeatureSelector(RF_ESTIMATORS)
clf.fit(feature_selector.fit_transform(scaler.fit_transform(X), y), y)
accuracy_score(y_val, clf.predict(feature_selector.transform(scaler.transform(X_val))))

ensemble = SuperLearner(folds=N_FOLDS, shuffle=True, random_state=RANDOM_STATE, scorer=balanced_accuracy_score, backend="threading")

preprocessing = {'pipeline-1': [StandardScaler(), RFBasedFeatureSelector(RF_ESTIMATORS)]
                 
                }

estimators = {'pipeline-1': [RandomForestClassifier(RF_ESTIMATORS, random_state=RANDOM_STATE, class_weight='balanced'),                 
                                         ]
                 }

ensemble.add(estimators, preprocessing)

ensemble.add_meta(LogisticRegression(solver='liblinear', class_weight = 'balanced'))
ensemble.fit(X,y)```

Answer 1

I think the problem is in shuffle=True .我认为问题出在shuffle=True 。 I had a similar problem and when setting 'shuffle=False', it no longer gives the error message.我有一个类似的问题，当设置'shuffle = False'时，它不再给出错误消息。

Python Mlens Ensemble：KeyError：“[Int64Index([... dtype='int64', length=105)] 均不在 [columns] 中”

问题描述

1 个解决方案

解决方案1
0 2021-10-20 17:03:28

Python Mlens Ensemble：KeyError：“[Int64Index([... dtype='int64', length=105)] 均不在 [columns] 中”

问题描述

1 个解决方案

解决方案1 0 2021-10-20 17:03:28

解决方案1
0 2021-10-20 17:03:28