[英]Encoding Multiple Categorical Data with Python using sklearn.preprocessing.LabelEncoder() takes too much processing time on 2D array inputs
Consider for some reason I am trying to encode a feature.出于某种原因考虑我正在尝试对功能进行编码。 Let's say my feature name is title .假设我的功能名称是title 。 For the title feature, for one record I might have different words: title = 'Apple', 'Jobs'.对于标题功能,对于一条记录,我可能有不同的词:title = 'Apple', 'Jobs'。 Let me illustrate:让我举例说明:
ID title
0 ['Apple', 'Jobs']
1 ['Wozniak']
2 ['Apple', 'Wozniak']
3 ['Jobs', 'Wozniak']
As you could see my unique values are:如您所见,我的独特价值观是:
unique = ['Apple','Jobs','Wozniak']
And previously I was using label encoder as:之前我使用标签编码器作为:
from sklearn.preprocessing import LabelEncoder
le.fit(unique)
for i in df['title'].index:
df['title'][i] = le.transform(df['title'][i])
And I used to get something like:我曾经得到过类似的东西:
ID title
0 [782, 256]
1 [331]
2 [782, 331]
3 [256, 331]
which was exactly what I wanted;这正是我想要的; yet, this takes too much time because I have too many values to iterate and encode.然而,这会花费太多时间,因为我有太多的值需要迭代和编码。 Thus, I am looking for an algorithm that is smarter and preferably with a lower time complexity or smaller running time.因此,我正在寻找一种更智能且时间复杂度更低或运行时间更短的算法。
Later, I have discovered that first partitioning the elements of title up to 5 columns and applying label encoding solves my problem.后来发现,先把title的元素分割成5列,再应用标签编码就解决了我的问题。 I am sharing a sample of solution:我正在分享一个解决方案示例:
def process(self):
for i in range(self._MULHOTBEGIN,self._MULHOTEND):
print("\tWorking on",self.df.columns[i])
self._colnames.append(self.df.columns[i])
self.df = pd.merge(self.df,self._processSubRoutine(i), left_index=True, right_index=True)
print("\tCollecting garbage.")
self._collectGarbage()
def _processSubRoutine(self, colindex):
result = list()
for i in range(self._len):
truncated = self.df.iloc[i,colindex][:self._MAXLEN]
padded = list(truncated) + ['0']*(self._MAXLEN-len(truncated))
result.append(padded)
colnames = self._createColNames(colindex)
print("\t",colnames)
return pd.DataFrame(result,columns=colnames, index=self.df.index)
These processes are called inside the preprocessing file within这些过程在预处理文件中被调用
M = MulHotProcessor(reorderedSource,mulhotBeginIndex2, mulhotEndIndex2, max_len_per_slot_2)
M.process()
sourceProc = M.getDataFrame()
entvals = pd.concat([sourceProc['title_0'],sourceProc['title_1'], \
sourceProc['title_2'],sourceProc['title_3'], \
sourceProc['title_4']]).unique()
Later on we finally apply label encoding by fitting the unique values后来我们最终通过拟合唯一值来应用标签编码
le.fit(<unique_values>) # whatever you name your unique values
sourceProc['title_0']=le.transform(sourceProc['title_0'])
sourceProc['title_1']=le.transform(sourceProc['title_1'])
sourceProc['title_2']=le.transform(sourceProc['title_2'])
...
At the end you will have in very short time prepared the transformation from first DF to second DF in the question.最后,您将在很短的时间内准备好从问题中的第一个 DF 到第二个 DF 的转换。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.