簡體   English   中英

將 sklearn 管道和交叉驗證與二進制列相結合

[英]Combining sklearn pipeline and cross validation with binary columns

我想在具有一個文本列、五個二進制變量和一個數值目標變量的數據集上運行回歸 model。 我包含了一個CountVectorizer來對文本列進行矢量化,並嘗試使用make_column_transformer將其組合到 sklearn Pipeline中。 數據沒有任何缺失值 - 但是,在運行以下腳本時,我收到以下警告:

FitFailedWarning: Estimator fit failed. The score on this train-test 
partition for these parameters will be set to nan.

和以下錯誤信息:

TypeError: All estimators should implement fit and transform, or can be 
'drop' or 'passthrough' specifiers. 'Level1' (type <class 'str'>) doesn't.

我認為問題可能是我沒有在make_column_transformer中指定第二個元組,而僅指定以下內容: sample_df[categorical_cols]但我不確定如何在make_column_transformer中包含已處理的就緒數據。

完整代碼:

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score


categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']

pipeline = Pipeline([
    ('transformer', make_column_transformer((CountVectorizer(), textual_col), 
                                             sample_df[categorical_cols],
                                           remainder='passthrough')),
    ('model', RandomForestRegressor())
])

X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']

cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)
scores

樣本數據集:

import io

data_string = """
Level1;Level2;Level3;Level4;Level5;Text;Value
0;0;1;0;0;Are you sure that the input;109.3
0;0;0;0;0;that the input text data for;87.2
0;0;1;0;0;text data for your model is;21.5
0;0;0;0;0;your model is in English? Well,;143.5
0;0;0;0;1;in English? Well, no one can;141.1
0;0;0;0;0;no one can be sure about;93.4
0;0;0;0;0;be sure about this, as no;29.5
0;0;0;0;0;this, as no one will read;17.9
0;0;1;0;0;one will read around 20k records;37.8
0;0;1;0;0;around 20k records of text data.;153.7
0;0;0;0;0;of text data. So, how non-English;99.5
0;0;0;1;0;So, how non-English text will affect;119.1
0;0;0;0;1;text will affect your English text;97.5
0;0;0;0;0;your English text trained model? Pick;49.2
0;0;0;0;0;trained model? Pick any non-English text;79.3
0;0;0;0;0;any non-English text and pass it;107.7
0;1;0;0;1;and pass it through as input;117.3
0;0;0;0;0;through as input to your English;151.1
0;0;0;0;0;to your English text trained classification;47.3
0;0;0;0;0;text trained classification model. You will;129.3
0;0;0;0;0;model. You will come to know;135.1
0;0;0;0;0;come to know that the category;145.8
0;0;0;0;1;that the category is assigned to;131.9
1;0;0;1;0;is assigned to non-English text by;43.7
1;0;0;0;0;non-English text by the model. If;67.1
1;0;0;0;0;the model. If your model is;105.3
0;0;0;1;0;your model is dependent on one;65.2
0;1;0;0;0;dependent on one language then, other;98.3
0;0;0;0;0;language then, other languages in your;130.5
0;0;0;0;0;languages in your textual data should;107.2
0;1;1;0;0;textual data should be considered as;66.5
0;0;0;1;0;be considered as noise. But why?;43.1
0;0;0;0;1;noise. But why? The job of;56.7
0;0;0;0;0;The job of the text classification;75.1
1;0;0;0;0;the text classification model is to;88.3
1;0;0;0;0;model is to classify. And, it;91.3
0;0;0;0;0;classify. And, it will do its;106.4
1;0;0;0;0;will do its job despite its;109.5
0;0;0;0;1;job despite its input text will;143.1
0;0;0;0;0;input text will be in English;54.1
1;0;0;0;0;be in English or not. What;96.4
0;0;0;1;0;or not. What can we do;133.8
0;0;0;0;0;can we do to avoid such;146.4
0;0;1;0;0;to avoid such a situation? Your;164.3
0;0;1;0;0;a situation? Your model will not;34.6
0;0;0;0;0;model will not stop classifying the;76.8
0;0;0;1;0;stop classifying the non-English text. So,;80.5
0;0;1;0;0;non-English text. So, you have to;90.3
0;0;0;0;0;you have to detect the non-English;68.3
0;0;0;0;0;detect the non-English text and remove;44.0
0;0;1;0;0;text and remove it from trained;100.4
0;0;0;0;0;it from trained data and prediction;117.4
0;0;0;0;1;data and prediction data. This process;85.4
0;1;0;0;0;data. This process comes under the;65.7
0;0;1;0;0;comes under the data cleaning part.;54.3
0;1;0;0;0;data cleaning part. Inconsistency in your;78.9
0;0;0;0;0;Inconsistency in your data will result;96.8
1;0;0;0;1;data will result in a decrease;108.1
0;0;0;0;0;in a decrease in the accuracy;145.7
1;0;0;0;0;in the accuracy of the model.;103.6
0;0;1;0;0;of the model. Sometimes, multiple languages;56.4
0;0;0;0;1;Sometimes, multiple languages present in text;90.5
0;0;0;0;0;present in text data could be;80.4
0;0;0;0;0;data could be one of the;90.7
1;0;0;0;0;one of the reasons your model;48.8
0;0;0;0;0;reasons your model behaves strangely. So,;65.4
0;0;1;0;0;behaves strangely. So, in this article,;107.5
0;0;0;0;0;in this article, we will discuss;143.2
0;0;0;0;0;we will discuss the different python;165.0
0;0;0;0;0;the different python libraries which detect;123.3
0;0;0;0;1;libraries which detect the language(s) of;85.3
0;0;0;0;0;the language(s) of the text data.;91.4
0;0;0;0;1;the text data. Let’s start with;49.5
0;0;0;0;0;Let’s start with the spaCy library.;76.3
0;0;0;0;0;the spaCy library.;49.5
"""

sample_df = pd.read_csv(io.StringIO(data_string), sep=';')

您可以使用remainder='passthrough'來避免轉換已處理的列(因此,在您的情況下,您可以將二進制列視為ColumnTransformer object 不會處理的剩余列,但它將通過)。 然后您應該知道CountVectorizer需要一個一維數組作為輸入,因此您應該將要傳遞給make_column_transformer的列指定為字符串( 'Text' ),而不是數組( ['Text'] )(參見參考資料make_column_transformer()文檔)。

列:str、str 的類數組、int、int 的類數組、切片、bool 或可調用的類數組

在其第二個軸上索引數據。 整數被解釋為位置列,而字符串可以按名稱引用 DataFrame 列。 如果轉換器期望 X 是一維數組(向量),則應使用標量字符串或整數,否則將向轉換器傳遞一個二維數組。 可調用對象傳遞輸入數據 X 並可以返回上述任何內容。 對於 select 按名稱或 dtype 的多個列,可以使用 make_column_selector。

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score

categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']
pipeline = Pipeline([
    ('transformer', make_column_transformer((CountVectorizer(), 'Text'), 
                                             remainder='passthrough')),
    ('model', RandomForestRegressor())
])
X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)
scores

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM