[英]Python - Speed up for converting a categorical variable to it's numerical index
I need to convert a column of categorical variables in a Pandas data frame into a numerical value that corresponds to the index into an array of the unique categorical variables in the column (long story !) and here's a code snippet that accomplishes that: 我需要将Pandas数据框中的一列分类变量转换为一个数值,该值对应于列中唯一分类变量数组的索引(长篇故事!),这里是一个完成该代码的代码片段:
import pandas as pd
import numpy as np
d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])
for lab in uniq_lab:
df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)
which converts the data frame: 它转换数据框:
col
0 baked
1 beans
2 baked
3 baked
4 beans
into the data frame: 进入数据框:
col
0 0.0
1 1.0
2 0.0
3 0.0
4 1.0
as desired. 如预期的。 But my problem is that my dumb little for loop (the only way I've thought of to do this) is slow as molasses when I try to run similar code on big data files.
但我的问题是,当我尝试在大数据文件上运行类似的代码时,我的愚蠢的小循环(我想到这一点的唯一方法)就像糖蜜一样慢。 I was just curious as to whether anyone had any thoughts on whether there were any ways to do this more efficiently.
我只是好奇是否有人对是否有任何方法更有效地做到这一点有任何想法。 Thanks in advance for any thoughts.
提前感谢任何想法。
df['col'] = pd.factorize(df.col)[0]
print (df)
col
0 0
1 1
2 0
3 0
4 1
EDIT: 编辑:
As Jeff
mentioned in comment, then the best is convert column to categorical
mainly because less memory usage : 正如
Jeff
在评论中提到的那样,最好的是将列转换为categorical
主要是因为内存使用量较少:
df['col'] = df['col'].astype("category")
Timings : 时间 :
It is interesting, in large df pandas
is faster as numpy
. 有趣的是,大型df
pandas
的速度比numpy
快。 I cant believe it. 我不敢相信。
len(df)=500k
: len(df)=500k
:
In [29]: %timeit (a(df1))
100 loops, best of 3: 9.27 ms per loop
In [30]: %timeit (a1(df2))
100 loops, best of 3: 9.32 ms per loop
In [31]: %timeit (b(df3))
10 loops, best of 3: 24.6 ms per loop
In [32]: %timeit (b1(df4))
10 loops, best of 3: 24.6 ms per loop
len(df)=5k
: len(df)=5k
:
In [38]: %timeit (a(df1))
1000 loops, best of 3: 274 µs per loop
In [39]: %timeit (a1(df2))
The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop
In [40]: %timeit (b(df3))
The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 295 µs per loop
In [41]: %timeit (b1(df4))
1000 loops, best of 3: 294 µs per loop
len(df)=5
: len(df)=5
:
In [46]: %timeit (a(df1))
1000 loops, best of 3: 206 µs per loop
In [47]: %timeit (a1(df2))
1000 loops, best of 3: 204 µs per loop
In [48]: %timeit (b(df3))
The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop
In [49]: %timeit (b1(df4))
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop
Code for testing : 测试代码 :
d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
print (df)
df = pd.concat([df]*100000).reset_index(drop=True)
#test for 5k
#df = pd.concat([df]*1000).reset_index(drop=True)
df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()
def a(df):
df['col'] = pd.factorize(df.col)[0]
return df
def a1(df):
idx,_ = pd.factorize(df.col)
df['col'] = idx
return df
def b(df):
df['col'] = np.unique(df['col'],return_inverse=True)[1]
return df
def b1(df):
_,idx = np.unique(df['col'],return_inverse=True)
df['col'] = idx
return df
print (a(df1))
print (a1(df2))
print (b(df3))
print (b1(df4))
You can use np.unique
's optional argument return_inverse
to ID each string based on their uniqueness among others and set those in the input dataframe, like so - 你可以使用
np.unique
的可选参数return_inverse
根据它们的唯一性来识别每个字符串,并在输入数据帧中设置它们,如下所示 -
_,idx = np.unique(df['col'],return_inverse=True)
df['col'] = idx
Please note that the IDs
correspond to a unique alphabetically sorted array of the strings. 请注意,
IDs
对应于字符串的唯一字母排序数组。 If you have to get that unique array, you can replace _
with it, like so - 如果你必须得到那个独特的数组,你可以用它替换
_
,就像这样 -
uniq_lab,idx = np.unique(df['col'],return_inverse=True)
Sample run - 样品运行 -
>>> d = {'col': ["baked","beans","baked","baked","beans"]}
>>> df = pd.DataFrame(data=d)
>>> df
col
0 baked
1 beans
2 baked
3 baked
4 beans
>>> _,idx = np.unique(df['col'],return_inverse=True)
>>> df['col'] = idx
>>> df
col
0 0
1 1
2 0
3 0
4 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.