简体   繁体   English

将具有分类值的熊猫数据框转换为二进制值

[英]converting pandas dataframe with categorical values into binary values

I am trying to convert categorical data into binary to be able to classify with an algorithm like logistic regression . 我正在尝试将分类数据转换为二进制数据,以便能够使用逻辑回归等算法进行分类。 I thought of using OneHotEncoder from 'sklearn.preprocessing' module but the problem is the dataframe entries are A, B pairs of arrays with different lengths, each row has pair of same-length arrays not equal to array lengths in other rows. 我曾考虑过从'sklearn.preprocessing'模块使用OneHotEncoder,但问题是数据帧条目是A,B对长度不同的数组,每行有一对相同长度的数组,不等于其他行中的数组长度。 OneHotEncoder does not accept dataframe like mine OneHotEncoder不接受像我这样的数据帧

In [34]: data.index 在[34]中:data.index

Out[34]: Index([train1, train2, train3, ..., train7829, train7830, train7831], dtype=object) Out [34]:索引([train1,train2,train3,...,train7829,train7830,train7831],dtype = object)

In [35]:  data.columns

Out[35]:  Index([A, B], dtype=object)

SampleID                      A                                B
train1      [2092.0, 1143.0, 390.0, ...]          [5651.0, 4449.0, 4012.0...]
train2      [3158.0, 3158.0, 3684.0, 3684.0....]  [2.0, 4.0, 2.0, 1.0...]
train3      [1699.0, 1808.0 ,...]                 [0.0, 1.0...]

So, I want to highlight again that each A and B pair has the same length but the length is variable across different pairs. 因此,我想再次强调一下,每个A和B对都具有相同的长度,但是长度在不同的对之间是可变的。 Dataframe contains numerical, categorical and binary values. 数据框包含数字,分类和二进制值。 I have another csv file with the information about every entry type. 我还有另一个csv文件,其中包含有关每种条目类型的信息。 I read the file filter out categorical entries in both columns like this: 我在两列中读取了文件过滤掉的分类条目,如下所示:

info=data_io.read_train_info()
col1=info.columns[0]
col2=info.columns[1]
info=info[(info[col1]=='Categorical')&(info[col2]=='Categorical')]

Then I use info.index to filter my training dataframe 然后我使用info.index过滤我的训练数据info.index

filtered = data.loc[info.index]

Than I wrote an utility function to change dimensions of each array so that I can encode them later 比我编写了一个实用程序函数来更改每个数组的尺寸,以便以后进行编码

def setDim(df):
    for item in x[x.columns[0]].index:
        df[df.columns[0]][item].shape=(1,df[df.columns[0]][item].shape[0])
        df[df.columns[1]][item].shape=(1,df[df.columns[1]][item].shape[0])

setDim(filtered)

Then I thought to combine each pair of arrays into 2-row matrix so that I can pass it to encoder then to separate them again after encoding, like this: 然后,我考虑将每对数组组合成2行矩阵,这样我就可以将其传递给编码器,然后在编码后再次将它们分开,如下所示:

import numpy as np
from sklearn.preprocessing import OneHotEncoder

def makeSparse(df):
   enc = OneHotEncoder()
   for i in df.index:
     cd=np.append(df['A'][i],df['B'][i],axis=0)
     a=enc.fit_transform(cd)
     df['A'][i] = a[0,:]
     df['B'][i] = a[1,:]

makeSparse(filtered)

After all these steps get a sparse dataframe. 完成所有这些步骤后,将获得一个稀疏的数据帧。 My questions are: 我的问题是:

  1. is this a right way to encode this dataframe?(I highly doubt it) 这是编码此数据帧的正确方法吗(我对此表示高度怀疑)
  2. if no, then what alternatives do you offer? 如果没有,那么您提供什么选择?
    Thanks a lot for your time helping me. 非常感谢您的时间帮助我。

This is a nice way to transform your data to a better repr to deal with; 这是将您的数据转换为更好的代表进行处理的好方法; uses some neat apply tricks 使用一些整洁的应用技巧

In [72]: df
Out[72]: 
                               A                  B
train1         [2092, 1143, 390]  [5651, 449, 4012]
train2  [3158, 3158, 3684, 3684]       [2, 4, 2, 1]
train3              [1699, 1808]             [0, 1]

In [73]: concat(dict([ (x[0],x[1].apply(lambda y: Series(y))) for x in df.iterrows() ]))
Out[73]: 
             0     1     2     3
train1 A  2092  1143   390   NaN
       B  5651   449  4012   NaN
train2 A  3158  3158  3684  3684
       B     2     4     2     1
train3 A  1699  1808   NaN   NaN
       B     0     1   NaN   NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM