将具有分类值的熊猫数据框转换为二进制值

Question

I am trying to convert categorical data into binary to be able to classify with an algorithm like logistic regression . 我正在尝试将分类数据转换为二进制数据，以便能够使用逻辑回归等算法进行分类。 I thought of using OneHotEncoder from 'sklearn.preprocessing' module but the problem is the dataframe entries are A, B pairs of arrays with different lengths, each row has pair of same-length arrays not equal to array lengths in other rows. 我曾考虑过从'sklearn.preprocessing'模块使用OneHotEncoder，但问题是数据帧条目是A，B对长度不同的数组，每行有一对相同长度的数组，不等于其他行中的数组长度。 OneHotEncoder does not accept dataframe like mine OneHotEncoder不接受像我这样的数据帧

In [34]: data.index 在[34]中：data.index

Out[34]: Index([train1, train2, train3, ..., train7829, train7830, train7831], dtype=object) Out [34]：索引（[train1，train2，train3，...，train7829，train7830，train7831]，dtype = object）

In [35]:  data.columns

Out[35]:  Index([A, B], dtype=object)

SampleID                      A                                B
train1      [2092.0, 1143.0, 390.0, ...]          [5651.0, 4449.0, 4012.0...]
train2      [3158.0, 3158.0, 3684.0, 3684.0....]  [2.0, 4.0, 2.0, 1.0...]
train3      [1699.0, 1808.0 ,...]                 [0.0, 1.0...]

So, I want to highlight again that each A and B pair has the same length but the length is variable across different pairs. 因此，我想再次强调一下，每个A和B对都具有相同的长度，但是长度在不同的对之间是可变的。 Dataframe contains numerical, categorical and binary values. 数据框包含数字，分类和二进制值。 I have another csv file with the information about every entry type. 我还有另一个csv文件，其中包含有关每种条目类型的信息。 I read the file filter out categorical entries in both columns like this: 我在两列中读取了文件过滤掉的分类条目，如下所示：

info=data_io.read_train_info()
col1=info.columns[0]
col2=info.columns[1]
info=info[(info[col1]=='Categorical')&(info[col2]=='Categorical')]

Then I use info.index to filter my training dataframe 然后我使用info.index过滤我的训练数据info.index

filtered = data.loc[info.index]

Than I wrote an utility function to change dimensions of each array so that I can encode them later 比我编写了一个实用程序函数来更改每个数组的尺寸，以便以后进行编码

def setDim(df):
    for item in x[x.columns[0]].index:
        df[df.columns[0]][item].shape=(1,df[df.columns[0]][item].shape[0])
        df[df.columns[1]][item].shape=(1,df[df.columns[1]][item].shape[0])

setDim(filtered)

Then I thought to combine each pair of arrays into 2-row matrix so that I can pass it to encoder then to separate them again after encoding, like this: 然后，我考虑将每对数组组合成2行矩阵，这样我就可以将其传递给编码器，然后在编码后再次将它们分开，如下所示：

import numpy as np
from sklearn.preprocessing import OneHotEncoder

def makeSparse(df):
   enc = OneHotEncoder()
   for i in df.index:
     cd=np.append(df['A'][i],df['B'][i],axis=0)
     a=enc.fit_transform(cd)
     df['A'][i] = a[0,:]
     df['B'][i] = a[1,:]

makeSparse(filtered)

After all these steps get a sparse dataframe. 完成所有这些步骤后，将获得一个稀疏的数据帧。 My questions are: 我的问题是：

is this a right way to encode this dataframe?(I highly doubt it) 这是编码此数据帧的正确方法吗（我对此表示高度怀疑）
if no, then what alternatives do you offer? 如果没有，那么您提供什么选择？
Thanks a lot for your time helping me. 非常感谢您的时间帮助我。

Answer 1

This is a nice way to transform your data to a better repr to deal with; 这是将您的数据转换为更好的代表进行处理的好方法； uses some neat apply tricks 使用一些整洁的应用技巧

In [72]: df
Out[72]: 
                               A                  B
train1         [2092, 1143, 390]  [5651, 449, 4012]
train2  [3158, 3158, 3684, 3684]       [2, 4, 2, 1]
train3              [1699, 1808]             [0, 1]

In [73]: concat(dict([ (x[0],x[1].apply(lambda y: Series(y))) for x in df.iterrows() ]))
Out[73]: 
             0     1     2     3
train1 A  2092  1143   390   NaN
       B  5651   449  4012   NaN
train2 A  3158  3158  3684  3684
       B     2     4     2     1
train3 A  1699  1808   NaN   NaN
       B     0     1   NaN   NaN

将具有分类值的熊猫数据框转换为二进制值

问题描述

1 个解决方案

解决方案1
1 2013-06-27 18:21:01

将具有分类值的熊猫数据框转换为二进制值

问题描述

1 个解决方案

解决方案1 1 2013-06-27 18:21:01

解决方案1
1 2013-06-27 18:21:01