简体   繁体   English

为什么 sklearn 预处理 LabelEncoder inverse_transform 仅适用于一列?

[英]Why does sklearn preprocessing LabelEncoder inverse_transform apply from only one column?

I have a random forest model built with sklearn.我有一个用 sklearn 构建的随机森林模型。 The model is built in one file, and I have a second file where I use joblib to load the model and apply it to new data.该模型构建在一个文件中,我有第二个文件,我在其中使用 joblib 加载模型并将其应用于新数据。 The data has categorical fields that are converted via sklearn's preprocessing LabelEncoder.fit_transform .数据具有通过 sklearn 的预处理LabelEncoder.fit_transform转换的分类字段。 Once the prediction is made, I am attempting to reverse this conversion with LabelEncoder.inverse_transform .做出预测后,我尝试使用LabelEncoder.inverse_transform反转此转换。

Here is the code:这是代码:

 #transform the categorical rf inputs
 df["method"] = le.fit_transform(df["method"])
 df["vendor"] = le.fit_transform(df["vendor"])
 df["type"] = le.fit_transform(df["type"])
 df["name"] = le.fit_transform(df["name"])
 dups["address"] = le.fit_transform(df["address"])

 #designate inputs for rf model
 inputs = ["amt","vendor","type","name","address","method"]

 #load rf model and run it on new data
 from sklearn.externals import joblib
 rf = joblib.load('rf.pkl')
 predict = rf.predict(df[inputs])

 #reverse LabelEncoder fit_transform
 df["method"] = le.inverse_transform(df["method"])
 df["vendor"] = le.inverse_transform(df["vendor"])
 df["type"] = le.inverse_transform(df["type"])
 df["name"] = le.inverse_transform(df["name"])
 df["address"] = le.inverse_transform(df["address"])

 #convert target to numeric to make it play nice with SQL Server
 predict = pd.to_numeric(predict)

 #add target field to df
 df["prediction"] = predict

 #write results to SQL Server table
 import sqlalchemy
 engine = sqlalchemy.create_engine("mssql+pyodbc://<username>:<password>@UserDSN")
 df.to_sql('TABLE_NAME', engine, schema='SCHEMANAME', if_exists='replace', index=False)

Without the inverse_transform piece, the results are as expected: numeric codes in place of categorical values.没有inverse_transform部分,结果如预期:数字代码代替分类值。 With the inverse_transform piece, the results are odd: the categorical values corresponding to the "address" field are returned for all categorical fields.使用inverse_transform部分,结果很奇怪:为所有分类字段返回对应于“地址”字段的分类值。

So if 1600 Pennsylvania Avenue is encoded as the number 1, all categorical values encoded as the number 1 (regardless of field) now return 1600 Pennsylvania Avenue.因此,如果宾夕法尼亚大道 1600 号编码为数字 1,则所有编码为数字 1(无论字段如何)的分类值现在都返回 1600 宾夕法尼亚大道。 Why is inverse_transform picking one column from which to reverse all fit_transform codes?为什么inverse_transform选择一列从中反转所有fit_transform代码?

This is the expected behaviour.这是预期的行为。

When you call le.fit_transform() , the internal parameters (classes learned) of the LabelEncoder are re-initialised.当您调用le.fit_transform() ,将重新初始化 LabelEncoder 的内部参数(学习的类)。 The le object is fitted onto the values of the column you supplied. le对象适合您提供的列的值。

In the above code, you are using the same object to transform all columns, and the last column you supplied is the address .在上面的代码中,您使用同一个对象来转换所有列,您提供的最后一列是address Hence the le forgets all info about previous calls to fit() (or fit_transform() in this case), and again learns the new data.因此, le忘掉以前调用的所有信息fit()fit_transform()在这种情况下),并重新学习到新的数据。 So when you call inverse_transform() on it, it only returns values related to address .因此,当您对其调用inverse_transform()时,它仅返回与address相关的值。 Hope I'm clear.希望我很清楚。

To encode all columns, you need to initialize different objects, one for each column.要对所有列进行编码,您需要初始化不同的对象,每列一个。 Something like below:像下面这样:

 df["method"] = le_method.fit_transform(df["method"])
 df["vendor"] = le_vendor.fit_transform(df["vendor"])
 df["type"] = le_type.fit_transform(df["type"])
 df["name"] = le_name.fit_transform(df["name"])
 df["address"] = le_address.fit_transform(df["address"])

and then call inverse_transform() on the appropriate encoder.然后在适当的编码器上调用inverse_transform()

I know this is an old question, however for everyone who likes convenience:我知道这是一个老问题,但是对于每个喜欢方便的人来说:

apply , coupled with lambda can transform multiple/all columns with ease apply ,加上lambda可以轻松转换多个/所有列

df = df.apply(lambda col: le.fit_transform(col))

I despise non-aliased, non-dynamic code ( you should too ) like so, unless really necessary:我鄙视非锯齿的非动态代码(你也应该如此),除非真的有必要:

 df["method"] = le_method.fit_transform(df["method"])
 df["vendor"] = le_vendor.fit_transform(df["vendor"])
 df["type"] = le_type.fit_transform(df["type"])
 df["name"] = le_name.fit_transform(df["name"])
 df["address"] = le_address.fit_transform(df["address"])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Inverse_transform 方法 (LabelEncoder) - Inverse_transform method (LabelEncoder) Sklearn inverse_transform 适合多列时仅返回一列 - Sklearn inverse_transform return only one column when fit to many sklearn.preprocessing.MinMaxScalar 中“inverse_transform”function 的奇怪行为 - strange behaivour of 'inverse_transform' function in sklearn.preprocessing.MinMaxScalar 如何在 OneHotEncoder 和 LabelEncoder 中做 inverse_transform? - How to do inverse_transform in OneHotEncoder and LabelEncoder? 使用sklearn NMF组件重建新数据与inverse_transform不匹配 - Reconstructing new data using sklearn NMF components Vs inverse_transform does not match sklearn - 无法立即调用MultiLabelBinarizer的inverse_transform - sklearn - Cannot call inverse_transform of MultiLabelBinarizer right away sklearn inverse_transform 返回巨大的值并将 MAE 降至零 - sklearn inverse_transform return huge values and drops MAE to zero Scikit的LabelEncoder使用numpy.int64代替inverse_transform中的整数。 - Scikit's LabelEncoder uses `numpy.int64` instead of integers in `inverse_transform` LabelEncoder 在输入缺失值后无法 inverse_transform(看不见的标签) - LabelEncoder cannot inverse_transform (unseen labels) after imputing missing values 是否可以将 sklearn.preprocessing.LabelEncoder() 应用于 2D 列表? - Is it possible to apply sklearn.preprocessing.LabelEncoder() on a 2D list?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM