Python - 加速将分类变量转换为数字索引

Question

I need to convert a column of categorical variables in a Pandas data frame into a numerical value that corresponds to the index into an array of the unique categorical variables in the column (long story !) and here's a code snippet that accomplishes that: 我需要将Pandas数据框中的一列分类变量转换为一个数值，该值对应于列中唯一分类变量数组的索引（长篇故事！），这里是一个完成该代码的代码片段：

import pandas as pd
import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])

for lab in uniq_lab:
    df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

which converts the data frame: 它转换数据框：

    col
 0  baked
 1  beans
 2  baked
 3  baked
 4  beans

into the data frame: 进入数据框：

as desired. 如预期的。 But my problem is that my dumb little for loop (the only way I've thought of to do this) is slow as molasses when I try to run similar code on big data files. 但我的问题是，当我尝试在大数据文件上运行类似的代码时，我的愚蠢的小循环（我想到这一点的唯一方法）就像糖蜜一样慢。 I was just curious as to whether anyone had any thoughts on whether there were any ways to do this more efficiently. 我只是好奇是否有人对是否有任何方法更有效地做到这一点有任何想法。 Thanks in advance for any thoughts. 提前感谢任何想法。

Answer 1

Use factorize : 使用factorize ：

df['col'] = pd.factorize(df.col)[0]
print (df)
   col
0    0
1    1
2    0
3    0
4    1

Docs 文件

EDIT: 编辑：

As Jeff mentioned in comment, then the best is convert column to categorical mainly because less memory usage : 正如Jeff在评论中提到的那样，最好的是将列转换为categorical主要是因为内存使用量较少：

df['col'] = df['col'].astype("category")

Timings : 时间：

It is interesting, in large df pandas is faster as numpy . 有趣的是，大型df pandas的速度比numpy快。 I cant believe it. 我不敢相信。

len(df)=500k : len(df)=500k ：

In [29]: %timeit (a(df1))
100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))
100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))
10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))
10 loops, best of 3: 24.6 ms per loop

len(df)=5k : len(df)=5k ：

In [38]: %timeit (a(df1))
1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))
The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))
The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))
1000 loops, best of 3: 294 µs per loop

len(df)=5 : len(df)=5 ：

In [46]: %timeit (a(df1))
1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))
1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))
The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

Code for testing : 测试代码 ：

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
print (df)
df = pd.concat([df]*100000).reset_index(drop=True)
#test for 5k
#df = pd.concat([df]*1000).reset_index(drop=True)


df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):
    df['col'] = pd.factorize(df.col)[0]
    return df

def a1(df):
    idx,_ = pd.factorize(df.col)
    df['col'] = idx
    return df

def b(df):
    df['col'] = np.unique(df['col'],return_inverse=True)[1]
    return df

def b1(df):
    _,idx = np.unique(df['col'],return_inverse=True)
    df['col'] = idx    
    return df

print (a(df1))    
print (a1(df2))   
print (b(df3))   
print (b1(df4))

Answer 2

You can use np.unique 's optional argument return_inverse to ID each string based on their uniqueness among others and set those in the input dataframe, like so - 你可以使用np.unique的可选参数return_inverse根据它们的唯一性来识别每个字符串，并在输入数据帧中设置它们，如下所示 -

_,idx = np.unique(df['col'],return_inverse=True)
df['col'] = idx

Please note that the IDs correspond to a unique alphabetically sorted array of the strings. 请注意， IDs对应于字符串的唯一字母排序数组。 If you have to get that unique array, you can replace _ with it, like so - 如果你必须得到那个独特的数组，你可以用它替换_ ，就像这样 -

uniq_lab,idx = np.unique(df['col'],return_inverse=True)

Sample run - 样品运行 -

>>> d = {'col': ["baked","beans","baked","baked","beans"]}
>>> df = pd.DataFrame(data=d)
>>> df
     col
0  baked
1  beans
2  baked
3  baked
4  beans
>>> _,idx = np.unique(df['col'],return_inverse=True)
>>> df['col'] = idx
>>> df
   col
0    0
1    1
2    0
3    0
4    1

Python - 加速将分类变量转换为数字索引

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-06-07 07:14:39

解决方案2
3 2016-06-07 07:15:39

Python - 加速将分类变量转换为数字索引

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-06-07 07:14:39

解决方案2 3 2016-06-07 07:15:39

解决方案1
5 已采纳 2016-06-07 07:14:39

解决方案2
3 2016-06-07 07:15:39