简体   繁体   English

Python Mlens Ensemble:KeyError:“[Int64Index([... dtype='int64', length=105)] 均不在 [columns] 中”

[英]Python Mlens Ensemble: KeyError: "None of [Int64Index([... dtype='int64', length=105)] are in the [columns]"

Following is a small version of code where I'm getting this error: KeyError: "None of [Int64Index([...], dtype='int64')] are in the [columns]"以下是我收到此错误的代码的小版本: KeyError:“[Int64Index([...], dtype='int64')] 均不在 [columns] 中”

'...' is a series of numbers that seem to match the index of my X and y dataframes. '...' 是一系列数字,似乎与我的 X 和 y 数据帧的索引相匹配。

I am using the Mlens package to model with SuperLearner on an very large dataset (so scalability is important).我在一个非常大的数据集上使用带有 SuperLearner 的 Mlens package 到 model(因此可扩展性很重要)。 My goal is to use a dataframe structure rather than a Numpy array.我的目标是使用 dataframe 结构而不是 Numpy 数组。 This will solve downstream issues.这将解决下游问题。

So far, I've explored this and other related posts, but the solutions do not seem to apply here.到目前为止,我已经探索了这篇文章和其他相关文章,但解决方案似乎不适用于这里。

The dataset is the Iris dataset found here as a.csv: <https://datahub.io/machine-learning/iris#data/该数据集是在这里找到的鸢尾花数据集,名称为 a.csv:<https://datahub.io/machine-learning/iris#data/

Note, that a custom Random Forest function works well.请注意,自定义随机森林 function 效果很好。 But the mlens/SuperLearner errors.但是 mlens/SuperLearner 错误。

from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from mlens.ensemble.super_learner import SuperLearner
import numpy as np
import pandas as pd

df = pd.read_csv("/home/marktest/iris_csv.csv")
type(df)
N_FOLDS = 5
RF_ESTIMATORS = 100
RANDOM_STATE = 42
class RFBasedFeatureSelector(BaseEstimator):
  
    def __init__(self, n_estimators):
        self.n_estimators = n_estimators
        self.selector = None

    def fit(self, X, y):
        clf = RandomForestClassifier(n_estimators=self.n_estimators, random_state = RANDOM_STATE, class_weight = 'balanced')
        clf = clf.fit(X, y)
        self.selector = SelectFromModel(clf, prefit=True, threshold = 0.01)

    def transform(self, X):
        if self.selector is None:
            raise AttributeError('The selector attribute has not been assigned. You cannot call transform before first calling fit or fit_transform.')
        return self.selector.transform(X)

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)
df.head()
X = df.iloc[:,0:3]                                    # split off features into new dataframe
y = df.iloc[:,4]                                     # split off outcome into new dataframe

X, X_val, y, y_val = train_test_split(X, y, test_size=.3, random_state=RANDOM_STATE, stratify=y)
from mlens.metrics import make_scorer
from sklearn.metrics import roc_auc_score, balanced_accuracy_score
accuracy_scorer = make_scorer(roc_auc_score, average='micro', greater_is_better=True)

clf = RandomForestClassifier(RF_ESTIMATORS, random_state=RANDOM_STATE,class_weight='balanced')
scaler = StandardScaler()
feature_selector = RFBasedFeatureSelector(RF_ESTIMATORS)
clf.fit(feature_selector.fit_transform(scaler.fit_transform(X), y), y)
accuracy_score(y_val, clf.predict(feature_selector.transform(scaler.transform(X_val))))

ensemble = SuperLearner(folds=N_FOLDS, shuffle=True, random_state=RANDOM_STATE, scorer=balanced_accuracy_score, backend="threading")

preprocessing = {'pipeline-1': [StandardScaler(), RFBasedFeatureSelector(RF_ESTIMATORS)]
                 
                }

estimators = {'pipeline-1': [RandomForestClassifier(RF_ESTIMATORS, random_state=RANDOM_STATE, class_weight='balanced'),                 
                                         ]
                 }

ensemble.add(estimators, preprocessing)

ensemble.add_meta(LogisticRegression(solver='liblinear', class_weight = 'balanced'))
ensemble.fit(X,y)```

I think the problem is in shuffle=True .我认为问题出在shuffle=True I had a similar problem and when setting 'shuffle=False', it no longer gives the error message.我有一个类似的问题,当设置'shuffle = False'时,它不再给出错误消息。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Receiving KeyError: “[Int64Index([ ... dtype=&#39;int64&#39;, length=1323)] 都不在 [columns]” - Receiving KeyError: "None of [Int64Index([ ... dtype='int64', length=1323)] are in the [columns]" KeyError:“[Int64Index dtype=&#39;int64&#39;, length=9313)] 都不在 [columns]” - KeyError: "None of [Int64Index dtype='int64', length=9313)] are in the [columns]" 读取 CSV &amp; Columns - KeyError: “[Int64Index([0, 1, 2, 3], dtype='int64')] 都在 [columns] 中” - Reading CSV & Columns - KeyError: “None of [Int64Index([0, 1, 2, 3], dtype='int64')] are in the [columns]” KeyError: &quot;[Int64Index([ 12313,\\n, 34534],\\n dtype=&#39;int64&#39;, leng - KeyError: "None of [Int64Index([ 12313,\n , 34534],\n dtype='int64', leng Sklearn 错误:[Int64Index([2, 3], dtype=&#39;int64&#39;)] 均不在 [columns] 中 - Sklearn error: None of [Int64Index([2, 3], dtype='int64')] are in the [columns] 关键错误:[Int64Index…] dtype='int64] 均不在 [columns] 中 - Key Error: None of [Int64Index…] dtype='int64] are in the [columns] 关键错误:[Int64Index([…]dtype='int64')] 均不在 [columns] 中 - Key Error: None of [Int64Index([…]dtype='int64')] are in the [columns] [Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name='index')] 中没有一个在 [index] - None of [Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64', name='index')] are in the [index] KeyError:使用 drop_duplicates 时的 Int64Index([1], dtype='int64') - KeyError: Int64Index([1], dtype='int64') when using drop_duplicates 迭代并更改以熊猫为单位的行的值(错误“ [index]中[Int64Index([10],dtype =&#39;int64&#39;)]都不存在”) - Iterating and changing value of the row in pandas ( Error “None of [Int64Index([10], dtype='int64')] are in the [index]” )
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM