简体   繁体   English

如何加速LabelEncoder将分类变量重新编码为整数

[英]How to speed LabelEncoder up recoding a categorical variable into integers

I have a large csv with two strings per row in this form: 我有一个大的csv,每行有两个字符串:

g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h

I read in the first two columns and recode the strings to integers as follows: 我在前两列中读取并将字符串重新编码为整数,如下所示:

import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)

# Convert to digits.
df = df.apply(le.transform)

This code is from https://stackoverflow.com/a/39419342/2179021 . 此代码来自https://stackoverflow.com/a/39419342/2179021

The code works very well but is slow when df is large. 代码工作得很好,但是当df很大时代码很慢。 I timed each step and the result was surprising to me. 我计划每一步,结果令我感到惊讶。

  • pd.read_csv takes about 40 seconds. pd.read_csv大约需要40秒。
  • le.fit(df.values.flat) takes about 30 seconds le.fit(df.values.flat)大约需要30秒
  • df = df.apply(le.transform) takes about 250 seconds. df = df.apply(le.transform)大约需要250秒。

Is there any way to speed up this last step? 有没有办法加快这最后一步? It feels like it should be the fastest step of them all! 感觉它应该是他们所有人中最快的一步!


More timings for the recoding step on a computer with 4GB of RAM 在具有4GB RAM的计算机上进行重新编码步骤的更多时间

The answer below by maxymoo is fast but doesn't give the right answer. maxymoo下面的答案很快,但没有给出正确的答案。 Taking the example csv from the top of the question, it translates it to: 以问题顶部的示例csv为例,将其转换为:

   0  1
0  4  6
1  0  4
2  2  5
3  6  3
4  3  5
5  5  4
6  1  1
7  3  2
8  5  0
9  3  4

Notice that 'd' is mapped to 3 in the first column but 2 in the second. 请注意,'d'在第一列中映射到3,在第二列中映射到2。

I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following. 我尝试了https://stackoverflow.com/a/39356398/2179021的解决方案,并获得以下内容。

df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop

Then I increased the dataframe size by a factor of 10. 然后我将数据帧大小增加了10倍。

df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str) 
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError                               Traceback (most recent call last)

This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes. 这种方法似乎使用了大量的RAM来尝试翻译它崩溃的相对较小的数据帧。

I also timed LabelEncoder with the larger dataset with 10 millions rows. 我还为LabelEncoder定时了一个包含1000万行的更大数据集。 It runs without crashing but the fit line alone took 50 seconds. 它运行没有崩溃,但单独的拟合线需要50秒。 The df.apply(le.transform) step took about 80 seconds. df.apply(le.transform)步骤大约需要80秒。

How can I: 我怎么能够:

  1. Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns. 得到一些大致与maxymoo的答案速度大致相当于LabelEncoder的内存使用量,但是当数据帧有两列时,它给出了正确的答案。
  2. Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)? 存储映射,以便我可以将其重用于不同的数据(就像LabelEncoder允许我这样做的那样)?

It looks like it will be much faster to use the pandas category datatype; 看起来使用pandas category数据类型要快得多; internally this uses a hash table rather whereas LabelEncoder uses a sorted search: 在内部,它使用哈希表,而LabelEncoder使用排序搜索:

In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 
                            'ID_1':np.random.randint(0,1000,1000000)}).astype(str)

In [88]: le.fit(df.values.flat) 
         %time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s

In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms

EDIT: Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency) 编辑:这是一个你可以使用的自定义变换器类(你可能不会在官方scikit-learn版本中看到这个,因为维护者不希望将pandas作为依赖项)

import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin

class PandasLabelEncoder(BaseEstimator, TransformerMixin):
    def fit(self, y):
        self.classes_ = unique1d(y)
        return self

    def transform(self, y):
        s = pd.Series(y).astype('category', categories=self.classes_)
        return s.cat.codes

I tried this with the DataFrame: 我用DataFrame尝试了这个:

In [xxx]: import string
In [xxx]: letters = np.array([c for c in string.ascii_lowercase])
In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)})

It looks like this: 它看起来像这样:

In [261]: df.head()
Out[261]: 
  ID_0 ID_1
0    v    z
1    i    i
2    d    n
3    z    r
4    x    x

In [262]: df.shape
Out[262]: (10000000, 2)

So, 10 million rows. 所以,1000万行。 Locally, my timings are: 在当地,我的时间是:

In [257]: % timeit le.fit(df.values.flat)
1 loops, best of 3: 17.2 s per loop

In [258]: % timeit df2 = df.apply(le.transform)
1 loops, best of 3: 30.2 s per loop

Then I made a dict mapping letters to numbers and used pandas.Series.map: 然后我做了一个dict映射字母到数字和使用pandas.Series.map:

In [248]: letters = np.array([l for l in string.ascii_lowercase])
In [263]: d = dict(zip(letters, range(26)))

In [273]: %timeit for c in df.columns: df[c] = df[c].map(d)
1 loops, best of 3: 1.12 s per loop

In [274]: df.head()
Out[274]: 
   ID_0  ID_1
0    21    25
1     8     8
2     3    13
3    25    17
4    23    23

So that might be an option. 所以这可能是一个选择。 The dict just needs to have all of the values that occur in the data. dict只需要拥有数据中出现的所有值。

EDIT: The OP asked what timing I have for that second option, with categories. 编辑:OP询问我的第二种选择的时间,包括类别。 This is what I get: 这就是我得到的:

In [40]: %timeit   x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()
1 loops, best of 3: 13.5 s per loop

EDIT: per the 2nd comment: 编辑:根据第2条评论:

In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel()))
1 loops, best of 3: 933 ms per loop

In [46]: %timeit  dfc = df.apply(lambda x: x.astype('category', categories=uniques))
1 loops, best of 3: 1.35 s per loop

I would like to point out an alternate solution that should serve many readers well. 我想指出一个替代解决方案,应该很好地服务于许多读者。 Although I prefer to have a known set of IDs, it is not always necessary if this is strictly one-way remapping. 虽然我更喜欢拥有一组已知的ID,但如果这是严格的单向重映射并不总是必要的。

Instead of 代替

df[c] = df[c].apply(le.transform)

or 要么

dict_table = {val: i for i, val in enumerate(uniques)}
df[c] = df[c].map(dict_table)

or (checkout _encode() and _encode_python() in sklearn source code , which I assume is faster on average than other methods mentioned) 或者( sklearn源代码中的 checkout _encode()和_encode_python(),我假设它比其他提到的方法更快)

df[c] = np.array([dict_table[v] for v in df[c].values])

you can instead do 你可以改为做

df[c] = df[c].apply(hash)

Pros: much faster, less memory needed, no training, hashes can be reduced to smaller representations (more collisions by casting dtype). 优点:更快,更少的内存,没有训练,哈希可以减少到更小的表示(通过投射dtype更多的冲突)。

Cons: gives funky numbers, can have collisions (not guaranteed to be perfectly unique), can't guarantee the function won't change with a new version of python 缺点:给出时髦的数字,可以有碰撞(不保证完全独特),不能保证功能不会随新版本的python而改变

Note that the secure hash functions will have fewer collisions at the cost of speed. 请注意,安全散列函数将以较低的速度减少冲突。

Example of when to use: You have somewhat long strings that are mostly unique and the data set is huge. 何时使用的示例:您有一些长字符串,这些字符串大多是唯一的,并且数据集很大。 Most importantly, you don't care about rare hash collisions even though it can be a source of noise in your model's predictions. 最重要的是,即使它可能是模型预测中的噪声源,也不关心罕见的哈希冲突。

I've tried all the methods above and my workload was taking about 90 minutes to learn the encoding from training (1M rows and 600 features) and reapply that to several test sets, while also dealing with new values. 我已经尝试了上面的所有方法,我的工作量大约花了90分钟来学习训练编码(1M行和600个功能)并将其重新应用到多个测试集,同时还处理新值。 The hash method brought it down to a few minutes and I don't need to save any model. 哈希方法将其缩短到几分钟,我不需要保存任何模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM