简体   繁体   English

使用 sklearn.preprocessing.LabelEncoder() 使用 Python 编码多个分类数据在 2D 数组输入上需要太多处理时间

[英]Encoding Multiple Categorical Data with Python using sklearn.preprocessing.LabelEncoder() takes too much processing time on 2D array inputs

Consider for some reason I am trying to encode a feature.出于某种原因考虑我正在尝试对功能进行编码。 Let's say my feature name is title .假设我的功能名称是title For the title feature, for one record I might have different words: title = 'Apple', 'Jobs'.对于标题功能,对于一条记录,我可能有不同的词:title = 'Apple', 'Jobs'。 Let me illustrate:让我举例说明:

ID      title  
0  ['Apple', 'Jobs']  
1  ['Wozniak']
2  ['Apple', 'Wozniak']
3  ['Jobs', 'Wozniak']  

As you could see my unique values are:如您所见,我的独特价值观是:

unique = ['Apple','Jobs','Wozniak']

And previously I was using label encoder as:之前我使用标签编码器作为:

from sklearn.preprocessing import LabelEncoder

le.fit(unique)
for i in df['title'].index:
    df['title'][i] = le.transform(df['title'][i])

And I used to get something like:我曾经得到过类似的东西:

ID      title  
0  [782, 256]  
1  [331]
2  [782, 331]
3  [256, 331] 

which was exactly what I wanted;这正是我想要的; yet, this takes too much time because I have too many values to iterate and encode.然而,这会花费太多时间,因为我有太多的值需要迭代和编码。 Thus, I am looking for an algorithm that is smarter and preferably with a lower time complexity or smaller running time.因此,我正在寻找一种更智能且时间复杂度更低或运行时间更短的算法。

Later, I have discovered that first partitioning the elements of title up to 5 columns and applying label encoding solves my problem.后来发现,先把title的元素分割成5列,再应用标签编码就解决了我的问题。 I am sharing a sample of solution:我正在分享一个解决方案示例:

def process(self):
    for i in range(self._MULHOTBEGIN,self._MULHOTEND):
        print("\tWorking on",self.df.columns[i])
        self._colnames.append(self.df.columns[i])
        self.df = pd.merge(self.df,self._processSubRoutine(i), left_index=True, right_index=True)
    print("\tCollecting garbage.")
    self._collectGarbage()

def _processSubRoutine(self, colindex):
    result = list()
    for i in range(self._len):
        truncated = self.df.iloc[i,colindex][:self._MAXLEN]
        padded = list(truncated) + ['0']*(self._MAXLEN-len(truncated))
        result.append(padded)
    colnames = self._createColNames(colindex)
    print("\t",colnames)
    return pd.DataFrame(result,columns=colnames, index=self.df.index)

These processes are called inside the preprocessing file within这些过程在预处理文件中被调用

M = MulHotProcessor(reorderedSource,mulhotBeginIndex2, mulhotEndIndex2, max_len_per_slot_2)
M.process()
sourceProc = M.getDataFrame()

entvals = pd.concat([sourceProc['title_0'],sourceProc['title_1'], \
              sourceProc['title_2'],sourceProc['title_3'], \
              sourceProc['title_4']]).unique()

Later on we finally apply label encoding by fitting the unique values后来我们最终通过拟合唯一值来应用标签编码

le.fit(<unique_values>) # whatever you name your unique values

sourceProc['title_0']=le.transform(sourceProc['title_0'])
sourceProc['title_1']=le.transform(sourceProc['title_1'])
sourceProc['title_2']=le.transform(sourceProc['title_2'])
...

At the end you will have in very short time prepared the transformation from first DF to second DF in the question.最后,您将在很短的时间内准备好从问题中的第一个 DF 到第二个 DF 的转换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM