[英]converting pandas dataframe with categorical values into binary values
I am trying to convert categorical data into binary to be able to classify with an algorithm like logistic regression . 我正在尝试将分类数据转换为二进制数据,以便能够使用逻辑回归等算法进行分类。 I thought of using OneHotEncoder from 'sklearn.preprocessing' module but the problem is the dataframe entries are A, B pairs of arrays with different lengths, each row has pair of same-length arrays not equal to array lengths in other rows.
我曾考虑过从'sklearn.preprocessing'模块使用OneHotEncoder,但问题是数据帧条目是A,B对长度不同的数组,每行有一对相同长度的数组,不等于其他行中的数组长度。 OneHotEncoder does not accept dataframe like mine
OneHotEncoder不接受像我这样的数据帧
In [34]: data.index
在[34]中:data.index
Out[34]: Index([train1, train2, train3, ..., train7829, train7830, train7831], dtype=object)
Out [34]:索引([train1,train2,train3,...,train7829,train7830,train7831],dtype = object)
In [35]: data.columns
Out[35]: Index([A, B], dtype=object)
SampleID A B
train1 [2092.0, 1143.0, 390.0, ...] [5651.0, 4449.0, 4012.0...]
train2 [3158.0, 3158.0, 3684.0, 3684.0....] [2.0, 4.0, 2.0, 1.0...]
train3 [1699.0, 1808.0 ,...] [0.0, 1.0...]
So, I want to highlight again that each A and B pair has the same length but the length is variable across different pairs. 因此,我想再次强调一下,每个A和B对都具有相同的长度,但是长度在不同的对之间是可变的。 Dataframe contains numerical, categorical and binary values.
数据框包含数字,分类和二进制值。 I have another csv file with the information about every entry type.
我还有另一个csv文件,其中包含有关每种条目类型的信息。 I read the file filter out categorical entries in both columns like this:
我在两列中读取了文件过滤掉的分类条目,如下所示:
info=data_io.read_train_info()
col1=info.columns[0]
col2=info.columns[1]
info=info[(info[col1]=='Categorical')&(info[col2]=='Categorical')]
Then I use info.index
to filter my training dataframe 然后我使用
info.index
过滤我的训练数据info.index
filtered = data.loc[info.index]
Than I wrote an utility function to change dimensions of each array so that I can encode them later 比我编写了一个实用程序函数来更改每个数组的尺寸,以便以后进行编码
def setDim(df):
for item in x[x.columns[0]].index:
df[df.columns[0]][item].shape=(1,df[df.columns[0]][item].shape[0])
df[df.columns[1]][item].shape=(1,df[df.columns[1]][item].shape[0])
setDim(filtered)
Then I thought to combine each pair of arrays into 2-row matrix so that I can pass it to encoder then to separate them again after encoding, like this: 然后,我考虑将每对数组组合成2行矩阵,这样我就可以将其传递给编码器,然后在编码后再次将它们分开,如下所示:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def makeSparse(df):
enc = OneHotEncoder()
for i in df.index:
cd=np.append(df['A'][i],df['B'][i],axis=0)
a=enc.fit_transform(cd)
df['A'][i] = a[0,:]
df['B'][i] = a[1,:]
makeSparse(filtered)
After all these steps get a sparse dataframe. 完成所有这些步骤后,将获得一个稀疏的数据帧。 My questions are:
我的问题是:
This is a nice way to transform your data to a better repr to deal with; 这是将您的数据转换为更好的代表进行处理的好方法; uses some neat apply tricks
使用一些整洁的应用技巧
In [72]: df
Out[72]:
A B
train1 [2092, 1143, 390] [5651, 449, 4012]
train2 [3158, 3158, 3684, 3684] [2, 4, 2, 1]
train3 [1699, 1808] [0, 1]
In [73]: concat(dict([ (x[0],x[1].apply(lambda y: Series(y))) for x in df.iterrows() ]))
Out[73]:
0 1 2 3
train1 A 2092 1143 390 NaN
B 5651 449 4012 NaN
train2 A 3158 3158 3684 3684
B 2 4 2 1
train3 A 1699 1808 NaN NaN
B 0 1 NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.