KeyError：“[索引（[（'A'，'B'，'C'）]，dtype ='object'）]都不在[列]中

Question

我將我的 X 和 y 定義如下：

X=df[text_columns + categorical_columns + textual_columns + numeric_columns]
y=df[['Label']]

在哪里

text_columns='推文'
categorical_columns=['A','B','C']
numeric_columns =['N1','N2']

列名只是舉例。 然后我分成訓練/測試：

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=(1/5), random_state=38, stratify=y)

我正在嘗試構建一個定制的變壓器，如下所示：

分類的

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin

class CategoricalTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        super().__init__()

    # Return self nothing else to do here
    def fit(self, X, y=None):
        return self

    # Helper function that converts values to Binary depending on input
    def create_binary(self, obj):
        if obj == 0:
            return 'No'
        else:
            return 'Yes'

    # Transformer method for this transformer
    def transform(self, X, y=None):
        # Categorical features to pass down the categorical pipeline
        return X[[categorical_columns]].values
    
    def get_feature_names(self):
        return X.columns.tolist()
    
# Defining the steps in the categorical pipeline
categorical_pipeline = Pipeline(steps=[
    ('categorical_transformer', CategoricalTransformer()),
    ('one_hot_encoder', OneHotEncoder(handle_unknown='ignore'))])

文本class TextTransformer(BaseEstimator, TransformerMixin): def init (self): super()。 初始化（）

    # Return self nothing else to do here
    def fit(self, X, y=None):
        return self

    # Helper function that converts values to Binary depending on input
    def create_binary(self, obj):
        if obj == 0:
            return 'No'
        else:
            return 'Yes'

    # Transformer method for this transformer
    def transform(self, X, y=None):
        # Text features to pass down the text pipeline
        return X[['Tweet']].values
    
    def get_feature_names(self):
        return X.columns.tolist()
    
# Defining the steps in the text pipeline
text_pipeline = Pipeline(steps=[
    ('text_transformer', TextTransformer()),
    ('cv', CountVectorizer())])

數字

class NumericalTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # Numerical features to pass down the numerical pipeline
        X = X[[numeric_columns]]
        X = X.replace([np.inf, -np.inf], np.nan)
        return X.values
    
    def get_feature_names(self):
        return X.columns.tolist()
    
# Defining the steps in the numerical pipeline
numerical_pipeline = Pipeline(steps=[
    ('numerical_transformer', NumericalTransformer()),
    ('imputer', KNNImputer(n_neighbors=2)),
    ('minmax', MinMaxScaler())])

然后我使用特征聯合：

from sklearn.pipeline import FeatureUnion

union_pipeline = FeatureUnion(transformer_list=[
    ('categorical_pipeline', categorical_pipeline),
    ('numerical_pipeline', numerical_pipeline), 
    ('text_pipeline', text_pipeline)])

# Combining the custom imputer with the categorical, text and numerical pipeline
preprocess_pipeline = Pipeline(steps=[('full_pipeline', union_pipeline)])

但是當我運行 model

# MODEL
from sklearn import tree

# Decision Tree
decision_tree = tree.DecisionTreeClassifier()
full_pipeline = Pipeline(steps=[
    ('preprocess_pipeline', preprocess_pipeline),
    ('model', decision_tree)])

# fit on the complete pipeline
training = full_pipeline.fit(X, y)
print(full_pipeline.get_params())

# metrics
score_test = \
    round(training.score(X, y) * 100, 2)
print(f"\nTraining Accuracy: {score_test}")

我收到此錯誤：

---> 12 training = full_pipeline.fit(X, y)
<ipython-input-69-051568c7b272> in transform(self, X, y)
     21     def transform(self, X, y=None):
     22         # Categorical features to pass down the categorical pipeline
---> 23         return X[[('A','B','C')]].values
     24 
     25     def get_feature_names(self):
....

KeyError: "None of [Index([('A','B','C')], dtype='object')] are in the [columns]"

數字列也出現類似的錯誤。 TextTransformer 似乎是唯一一個沒有錯誤的工作。

我想我正在考慮的數據集/列有問題。

Answer 1

如果numeric_columns （和其他任何一個）是元組，那么你做

X[numeric_columns]

代替

X[[numeric_columns]]

到 select 來自 pandas DataFrame 的列子集

KeyError：“[索引（[（'A'，'B'，'C'）]，dtype ='object'）]都不在[列]中

問題描述

1 個解決方案

解決方案1
1 已采納 2021-05-17 13:10:28

KeyError：“[索引（[（'A'，'B'，'C'）]，dtype ='object'）]都不在[列]中

問題描述

1 個解決方案

解決方案1 1 已采納 2021-05-17 13:10:28

解決方案1
1 已采納 2021-05-17 13:10:28