Converting string/numerical data to categorical format in pandas

Question

I have a very large csv file that I have converted to a Pandas dataframe, which has string and integer/float values. I would like to change this data to categorical format in order to try and save some memory. I am basing this idea off of the documentation here: https://pandas.pydata.org/pandas-docs/version/0.20/categorical.html

My dataframe looks like the following:

    clean_data_measurements.head(20)

        station         date    prcp    tobs
    0   USC00519397 1/1/2010    0.08    65
    1   USC00519397 1/2/2010    0.00    63
    2   USC00519397 1/3/2010    0.00    74
    3   USC00519397 1/4/2010    0.00    76
    5   USC00519397 1/7/2010    0.06    70
    6   USC00519397 1/8/2010    0.00    64
    7   USC00519397 1/9/2010    0.00    68
    8   USC00519397 1/10/2010   0.00    73
    9   USC00519397 1/11/2010   0.01    64
    10  USC00519397 1/12/2010   0.00    61
    11  USC00519397 1/14/2010   0.00    66
    12  USC00519397 1/15/2010   0.00    65
    13  USC00519397 1/16/2010   0.00    68
    14  USC00519397 1/17/2010   0.00    64
    15  USC00519397 1/18/2010   0.00    72
    16  USC00519397 1/19/2010   0.00    66
    17  USC00519397 1/20/2010   0.00    66
    18  USC00519397 1/21/2010   0.00    69
    19  USC00519397 1/22/2010   0.00    67
    20  USC00519397 1/23/2010   0.00    67

It is precipitation data which goes on another 2700 rows. Since it is all of the same category (station number), it should be convertible to categorical format which will save processing time. I am just unsure of how to write the code. Can anyone help? Thanks.

Answer 1

I think we can convert object to category data by using factorize

objectdf=df.select_dtypes(include='object')

df.loc[:,objectdf.columns]=objectdf.apply(lambda x : pd.factorize(x)[0])
df
Out[452]: 
    station  date  prcp  tobs
0         0     0  0.08    65
1         0     1  0.00    63
2         0     2  0.00    74
3         0     3  0.00    76
5         0     4  0.06    70
6         0     5  0.00    64
7         0     6  0.00    68
8         0     7  0.00    73
9         0     8  0.01    64
10        0     9  0.00    61
11        0    10  0.00    66
12        0    11  0.00    65
13        0    12  0.00    68
14        0    13  0.00    64
15        0    14  0.00    72
16        0    15  0.00    66
17        0    16  0.00    66
18        0    17  0.00    69
19        0    18  0.00    67
20        0    19  0.00    67

You can try this as well.

for y,x in zip(df.columns,df.dtypes):
    if x == 'object':
        df[y]=pd.factorize(df[y])[0]
    elif x=='int64':
        df[y]=df[y].astype(np.int8)
    else:
        df[y]=df[y].astype(np.float32)

Converting string/numerical data to categorical format in pandas

Question

1 answers

solution1
0 2018-07-17 02:48:33

Converting string/numerical data to categorical format in pandas

Question

1 answers

solution1 0 2018-07-17 02:48:33

solution1
0 2018-07-17 02:48:33