简体   繁体   中英

Pre-processing, resampling and pipelines - and an error in between

I have a dataset with different type of variables: binary, categorical, numerical, textual.

 Text                                                  Age      Type           Link           Start    Passed  Default
0 care packag saint luke cathol church wa ...           21.0    organisation    saintlukemclean <2001.0 0   0
1   opportun busi group center food support compan...   23.0    organisation    cfanj           <2003.0 0   0
2   holiday ice rink persh squar depart cultur sit...   98.0    home            culturela       >1975.0 0   0

I have used different transformers, one for categorical (OneHotEncoder), one for numerical (SimpleImputer) and one for text variables (CountVectorizer/TF-IDF):

categorical_preprocessing = OneHotEncoder(handle_unknown='ignore')
# categorical_encoder =  ('CV',CountVectorizer())

numeric_preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])

# CountVectorizer
text_preprocessing_cv =  Pipeline(steps=[
    ('CV',CountVectorizer())
]) 

# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
    ('TF-IDF',TfidfVectorizer())       
])

to transform my features and passing them in pipelines (with classifiers Logistic Regression, Multinomial Naive Bayer, Random Forest and SVM) as follows:

preprocessing = ColumnTransformer(
    transformers=[
        ('text',text_preprocessing_cv, text_columns)
        ('category', categorical_preprocessing, categorical_columns),
        ('numeric', numeric_preprocessing, numerical_columns)
])

However, I have got an error at this step:

from sklearn.linear_model import LogisticRegression

clf = Pipeline(steps=[('preprocessor', preprocessing),
                      ('classifier', LogisticRegression())])

clf.fit(X_train, y_train) # <-- error

ValueError: Selected columns, ['Age','Default'] are not unique in dataframe.

This error might be caused because of my oversampling or because of my way to pre-process features... The right order for the resampling should be applying it only to the train set to avoid overfitting, but it is not clear to me if I need to consider the different types of variables and the transformers before/after resampling.

I would appreciate if you could help me in fixing the error, letting a pipeline working using those preprocessing. Thanks

Please refer to the code:

text_columns = ['Text']
    categorical_columns = ['Type', 'Link','Start']
    numerical_columns = ['Age','Default'] # can I consider the boolean as numerical?
            
          
        
    X = df[categorical_columns + numerical_columns+text_columns]
    y=  df['Passed']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, random_state=42)
            
     
    # Returning to one dataframe
    training_set = pd.concat([X_train, y_train], axis=1) # need for re-sampling technique
          
    passed=training_set[training_set['Passed']==1]
    not_passed=training_set[training_set['Passed']==0]

    # Oversampling the minority 
    oversample = resample(passed, 
                           replace=True, 
                     

  n_samples=len(not_passed),

# Returning to new training set
oversample_train = pd.concat([not_passed, oversample])
    
 train_df = oversample_train.copy() # this train set is after applying the re-sampling
 test_df = pd.concat([X_test, y_test], axis=1)

X_train=train_df.loc[:,train_df.columns !='Passed']
y_train=train_df[['Passed']

categorical_encoder = OneHotEncoder(handle_unknown='ignore')
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
])
text_transformer_cv =  Pipeline(steps=[
    ('cntvec',CountVectorizer())
]) 
 

# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
    ('TF-IDF',TfidfVectorizer())       
]) # TF-IDF
       
preprocessing = ColumnTransformer(
    transformers=
    [('category', categorical_encoder, categorical_columns),
     ('numeric', numerical_pipe, numerical_columns), # I think this is causing the error. But I do not know why not also categorical columns
     ('text',text_transformer_cv, text_columns)
])

clf = Pipeline(steps=[('preprocessor', preprocessing),
                      ('classifier', LogisticRegression())])

clf.fit(X_train, y_train)
   
```    

The issue is the way a single text column is passed. I hope future version of scikit-learn would allow ['Text',] but until then pass it directly:

...

text_columns = 'Text' # instead of ['Text']

preprocessing = ColumnTransformer(
    transformers=[
        ('text', text_preprocessing_cv, text_columns),
        ('category', categorical_preprocessing,
            categorical_columns), 
        ('numeric', numeric_preprocessing, numerical_columns)
    ],
    remainder='passthrough'
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM