简体   繁体   中英

Python Mlens Ensemble: KeyError: "None of [Int64Index([... dtype='int64', length=105)] are in the [columns]"

Following is a small version of code where I'm getting this error: KeyError: "None of [Int64Index([...], dtype='int64')] are in the [columns]"

'...' is a series of numbers that seem to match the index of my X and y dataframes.

I am using the Mlens package to model with SuperLearner on an very large dataset (so scalability is important). My goal is to use a dataframe structure rather than a Numpy array. This will solve downstream issues.

So far, I've explored this and other related posts, but the solutions do not seem to apply here.

The dataset is the Iris dataset found here as a.csv: <https://datahub.io/machine-learning/iris#data/

Note, that a custom Random Forest function works well. But the mlens/SuperLearner errors.

from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from mlens.ensemble.super_learner import SuperLearner
import numpy as np
import pandas as pd

df = pd.read_csv("/home/marktest/iris_csv.csv")
type(df)
N_FOLDS = 5
RF_ESTIMATORS = 100
RANDOM_STATE = 42
class RFBasedFeatureSelector(BaseEstimator):
  
    def __init__(self, n_estimators):
        self.n_estimators = n_estimators
        self.selector = None

    def fit(self, X, y):
        clf = RandomForestClassifier(n_estimators=self.n_estimators, random_state = RANDOM_STATE, class_weight = 'balanced')
        clf = clf.fit(X, y)
        self.selector = SelectFromModel(clf, prefit=True, threshold = 0.01)

    def transform(self, X):
        if self.selector is None:
            raise AttributeError('The selector attribute has not been assigned. You cannot call transform before first calling fit or fit_transform.')
        return self.selector.transform(X)

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)
df.head()
X = df.iloc[:,0:3]                                    # split off features into new dataframe
y = df.iloc[:,4]                                     # split off outcome into new dataframe

X, X_val, y, y_val = train_test_split(X, y, test_size=.3, random_state=RANDOM_STATE, stratify=y)
from mlens.metrics import make_scorer
from sklearn.metrics import roc_auc_score, balanced_accuracy_score
accuracy_scorer = make_scorer(roc_auc_score, average='micro', greater_is_better=True)

clf = RandomForestClassifier(RF_ESTIMATORS, random_state=RANDOM_STATE,class_weight='balanced')
scaler = StandardScaler()
feature_selector = RFBasedFeatureSelector(RF_ESTIMATORS)
clf.fit(feature_selector.fit_transform(scaler.fit_transform(X), y), y)
accuracy_score(y_val, clf.predict(feature_selector.transform(scaler.transform(X_val))))

ensemble = SuperLearner(folds=N_FOLDS, shuffle=True, random_state=RANDOM_STATE, scorer=balanced_accuracy_score, backend="threading")

preprocessing = {'pipeline-1': [StandardScaler(), RFBasedFeatureSelector(RF_ESTIMATORS)]
                 
                }

estimators = {'pipeline-1': [RandomForestClassifier(RF_ESTIMATORS, random_state=RANDOM_STATE, class_weight='balanced'),                 
                                         ]
                 }

ensemble.add(estimators, preprocessing)

ensemble.add_meta(LogisticRegression(solver='liblinear', class_weight = 'balanced'))
ensemble.fit(X,y)```

I think the problem is in shuffle=True . I had a similar problem and when setting 'shuffle=False', it no longer gives the error message.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM