使用 sklearn.preprocessing.LabelEncoder() 使用 Python 编码多个分类数据在 2D 数组输入上需要太多处理时间

Question

Consider for some reason I am trying to encode a feature.出于某种原因考虑我正在尝试对功能进行编码。 Let's say my feature name is title .假设我的功能名称是title 。 For the title feature, for one record I might have different words: title = 'Apple', 'Jobs'.对于标题功能，对于一条记录，我可能有不同的词：title = 'Apple', 'Jobs'。 Let me illustrate:让我举例说明：

ID      title  
0  ['Apple', 'Jobs']  
1  ['Wozniak']
2  ['Apple', 'Wozniak']
3  ['Jobs', 'Wozniak']

As you could see my unique values are:如您所见，我的独特价值观是：

unique = ['Apple','Jobs','Wozniak']

And previously I was using label encoder as:之前我使用标签编码器作为：

from sklearn.preprocessing import LabelEncoder

le.fit(unique)
for i in df['title'].index:
    df['title'][i] = le.transform(df['title'][i])

And I used to get something like:我曾经得到过类似的东西：

ID      title  
0  [782, 256]  
1  [331]
2  [782, 331]
3  [256, 331]

which was exactly what I wanted;这正是我想要的； yet, this takes too much time because I have too many values to iterate and encode.然而，这会花费太多时间，因为我有太多的值需要迭代和编码。 Thus, I am looking for an algorithm that is smarter and preferably with a lower time complexity or smaller running time.因此，我正在寻找一种更智能且时间复杂度更低或运行时间更短的算法。

Answer 1

Later, I have discovered that first partitioning the elements of title up to 5 columns and applying label encoding solves my problem.后来发现，先把title的元素分割成5列，再应用标签编码就解决了我的问题。 I am sharing a sample of solution:我正在分享一个解决方案示例：

def process(self):
    for i in range(self._MULHOTBEGIN,self._MULHOTEND):
        print("\tWorking on",self.df.columns[i])
        self._colnames.append(self.df.columns[i])
        self.df = pd.merge(self.df,self._processSubRoutine(i), left_index=True, right_index=True)
    print("\tCollecting garbage.")
    self._collectGarbage()

def _processSubRoutine(self, colindex):
    result = list()
    for i in range(self._len):
        truncated = self.df.iloc[i,colindex][:self._MAXLEN]
        padded = list(truncated) + ['0']*(self._MAXLEN-len(truncated))
        result.append(padded)
    colnames = self._createColNames(colindex)
    print("\t",colnames)
    return pd.DataFrame(result,columns=colnames, index=self.df.index)

These processes are called inside the preprocessing file within这些过程在预处理文件中被调用

M = MulHotProcessor(reorderedSource,mulhotBeginIndex2, mulhotEndIndex2, max_len_per_slot_2)
M.process()
sourceProc = M.getDataFrame()

entvals = pd.concat([sourceProc['title_0'],sourceProc['title_1'], \
              sourceProc['title_2'],sourceProc['title_3'], \
              sourceProc['title_4']]).unique()

Later on we finally apply label encoding by fitting the unique values后来我们最终通过拟合唯一值来应用标签编码

le.fit(<unique_values>) # whatever you name your unique values

sourceProc['title_0']=le.transform(sourceProc['title_0'])
sourceProc['title_1']=le.transform(sourceProc['title_1'])
sourceProc['title_2']=le.transform(sourceProc['title_2'])
...

At the end you will have in very short time prepared the transformation from first DF to second DF in the question.最后，您将在很短的时间内准备好从问题中的第一个 DF 到第二个 DF 的转换。

使用 sklearn.preprocessing.LabelEncoder() 使用 Python 编码多个分类数据在 2D 数组输入上需要太多处理时间

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-12-21 07:02:14

使用 sklearn.preprocessing.LabelEncoder() 使用 Python 编码多个分类数据在 2D 数组输入上需要太多处理时间

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-12-21 07:02:14

解决方案1
0 已采纳 2022-12-21 07:02:14