在熊猫中将字符串/数字数据转换为分类格式

Question

I have a very large csv file that I have converted to a Pandas dataframe, which has string and integer/float values. 我有一个非常大的csv文件，已将其转换为Pandas数据帧，该数据帧具有字符串和整数/浮点值。 I would like to change this data to categorical format in order to try and save some memory. 我想将此数据更改为分类格式，以尝试节省一些内存。 I am basing this idea off of the documentation here: https://pandas.pydata.org/pandas-docs/version/0.20/categorical.html 我将这个想法基于以下文档： https : //pandas.pydata.org/pandas-docs/version/0.20/categorical.html

My dataframe looks like the following: 我的数据框如下所示：

    clean_data_measurements.head(20)

        station         date    prcp    tobs
    0   USC00519397 1/1/2010    0.08    65
    1   USC00519397 1/2/2010    0.00    63
    2   USC00519397 1/3/2010    0.00    74
    3   USC00519397 1/4/2010    0.00    76
    5   USC00519397 1/7/2010    0.06    70
    6   USC00519397 1/8/2010    0.00    64
    7   USC00519397 1/9/2010    0.00    68
    8   USC00519397 1/10/2010   0.00    73
    9   USC00519397 1/11/2010   0.01    64
    10  USC00519397 1/12/2010   0.00    61
    11  USC00519397 1/14/2010   0.00    66
    12  USC00519397 1/15/2010   0.00    65
    13  USC00519397 1/16/2010   0.00    68
    14  USC00519397 1/17/2010   0.00    64
    15  USC00519397 1/18/2010   0.00    72
    16  USC00519397 1/19/2010   0.00    66
    17  USC00519397 1/20/2010   0.00    66
    18  USC00519397 1/21/2010   0.00    69
    19  USC00519397 1/22/2010   0.00    67
    20  USC00519397 1/23/2010   0.00    67

It is precipitation data which goes on another 2700 rows. 这是降水量数据，另外还有2700行。 Since it is all of the same category (station number), it should be convertible to categorical format which will save processing time. 由于它们属于同一类别（站号），因此应将其转换为分类格式，这样可以节省处理时间。 I am just unsure of how to write the code. 我只是不确定如何编写代码。 Can anyone help? 有人可以帮忙吗？ Thanks. 谢谢。

Answer 1

I think we can convert object to category data by using factorize 我认为我们可以通过使用factorize将对象转换为类别数据

objectdf=df.select_dtypes(include='object')

df.loc[:,objectdf.columns]=objectdf.apply(lambda x : pd.factorize(x)[0])
df
Out[452]: 
    station  date  prcp  tobs
0         0     0  0.08    65
1         0     1  0.00    63
2         0     2  0.00    74
3         0     3  0.00    76
5         0     4  0.06    70
6         0     5  0.00    64
7         0     6  0.00    68
8         0     7  0.00    73
9         0     8  0.01    64
10        0     9  0.00    61
11        0    10  0.00    66
12        0    11  0.00    65
13        0    12  0.00    68
14        0    13  0.00    64
15        0    14  0.00    72
16        0    15  0.00    66
17        0    16  0.00    66
18        0    17  0.00    69
19        0    18  0.00    67
20        0    19  0.00    67

You can try this as well. 您也可以尝试一下。

for y,x in zip(df.columns,df.dtypes):
    if x == 'object':
        df[y]=pd.factorize(df[y])[0]
    elif x=='int64':
        df[y]=df[y].astype(np.int8)
    else:
        df[y]=df[y].astype(np.float32)

在熊猫中将字符串/数字数据转换为分类格式

问题描述

1 个解决方案

解决方案1
0 2018-07-17 02:48:33

在熊猫中将字符串/数字数据转换为分类格式

问题描述

1 个解决方案

解决方案1 0 2018-07-17 02:48:33

解决方案1
0 2018-07-17 02:48:33