簡體   English   中英

Python - 加速將分類變量轉換為數字索引

[英]Python - Speed up for converting a categorical variable to it's numerical index

我需要將Pandas數據框中的一列分類變量轉換為一個數值,該值對應於列中唯一分類變量數組的索引(長篇故事!),這里是一個完成該代碼的代碼片段:

import pandas as pd
import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])

for lab in uniq_lab:
    df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

它轉換數據框:

    col
 0  baked
 1  beans
 2  baked
 3  baked
 4  beans

進入數據框:

    col
 0  0.0
 1  1.0
 2  0.0
 3  0.0
 4  1.0

如預期的。 但我的問題是,當我嘗試在大數據文件上運行類似的代碼時,我的愚蠢的小循環(我想到這一點的唯一方法)就像糖蜜一樣慢。 我只是好奇是否有人對是否有任何方法更有效地做到這一點有任何想法。 提前感謝任何想法。

使用factorize

df['col'] = pd.factorize(df.col)[0]
print (df)
   col
0    0
1    1
2    0
3    0
4    1

文件

編輯:

正如Jeff在評論中提到的那樣,最好的是將列轉換為categorical主要是因為內存使用量較少:

df['col'] = df['col'].astype("category")

時間

有趣的是,大型df pandas的速度比numpy快。 我不敢相信。

len(df)=500k

In [29]: %timeit (a(df1))
100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))
100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))
10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))
10 loops, best of 3: 24.6 ms per loop  

len(df)=5k

In [38]: %timeit (a(df1))
1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))
The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))
The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))
1000 loops, best of 3: 294 µs per loop

len(df)=5

In [46]: %timeit (a(df1))
1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))
1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))
The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

測試代碼

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
print (df)
df = pd.concat([df]*100000).reset_index(drop=True)
#test for 5k
#df = pd.concat([df]*1000).reset_index(drop=True)


df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):
    df['col'] = pd.factorize(df.col)[0]
    return df

def a1(df):
    idx,_ = pd.factorize(df.col)
    df['col'] = idx
    return df

def b(df):
    df['col'] = np.unique(df['col'],return_inverse=True)[1]
    return df

def b1(df):
    _,idx = np.unique(df['col'],return_inverse=True)
    df['col'] = idx    
    return df

print (a(df1))    
print (a1(df2))   
print (b(df3))   
print (b1(df4))  

你可以使用np.unique的可選參數return_inverse根據它們的唯一性來識別每個字符串,並在輸入數據幀中設置它們,如下所示 -

_,idx = np.unique(df['col'],return_inverse=True)
df['col'] = idx

請注意, IDs對應於字符串的唯一字母排序數組。 如果你必須得到那個獨特的數組,你可以用它替換_ ,就像這樣 -

uniq_lab,idx = np.unique(df['col'],return_inverse=True)

樣品運行 -

>>> d = {'col': ["baked","beans","baked","baked","beans"]}
>>> df = pd.DataFrame(data=d)
>>> df
     col
0  baked
1  beans
2  baked
3  baked
4  beans
>>> _,idx = np.unique(df['col'],return_inverse=True)
>>> df['col'] = idx
>>> df
   col
0    0
1    1
2    0
3    0
4    1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM