简体   繁体   中英

Converting string/numerical data to categorical format in pandas

I have a very large csv file that I have converted to a Pandas dataframe, which has string and integer/float values. I would like to change this data to categorical format in order to try and save some memory. I am basing this idea off of the documentation here: https://pandas.pydata.org/pandas-docs/version/0.20/categorical.html

My dataframe looks like the following:

    clean_data_measurements.head(20)

        station         date    prcp    tobs
    0   USC00519397 1/1/2010    0.08    65
    1   USC00519397 1/2/2010    0.00    63
    2   USC00519397 1/3/2010    0.00    74
    3   USC00519397 1/4/2010    0.00    76
    5   USC00519397 1/7/2010    0.06    70
    6   USC00519397 1/8/2010    0.00    64
    7   USC00519397 1/9/2010    0.00    68
    8   USC00519397 1/10/2010   0.00    73
    9   USC00519397 1/11/2010   0.01    64
    10  USC00519397 1/12/2010   0.00    61
    11  USC00519397 1/14/2010   0.00    66
    12  USC00519397 1/15/2010   0.00    65
    13  USC00519397 1/16/2010   0.00    68
    14  USC00519397 1/17/2010   0.00    64
    15  USC00519397 1/18/2010   0.00    72
    16  USC00519397 1/19/2010   0.00    66
    17  USC00519397 1/20/2010   0.00    66
    18  USC00519397 1/21/2010   0.00    69
    19  USC00519397 1/22/2010   0.00    67
    20  USC00519397 1/23/2010   0.00    67

It is precipitation data which goes on another 2700 rows. Since it is all of the same category (station number), it should be convertible to categorical format which will save processing time. I am just unsure of how to write the code. Can anyone help? Thanks.

I think we can convert object to category data by using factorize

objectdf=df.select_dtypes(include='object')

df.loc[:,objectdf.columns]=objectdf.apply(lambda x : pd.factorize(x)[0])
df
Out[452]: 
    station  date  prcp  tobs
0         0     0  0.08    65
1         0     1  0.00    63
2         0     2  0.00    74
3         0     3  0.00    76
5         0     4  0.06    70
6         0     5  0.00    64
7         0     6  0.00    68
8         0     7  0.00    73
9         0     8  0.01    64
10        0     9  0.00    61
11        0    10  0.00    66
12        0    11  0.00    65
13        0    12  0.00    68
14        0    13  0.00    64
15        0    14  0.00    72
16        0    15  0.00    66
17        0    16  0.00    66
18        0    17  0.00    69
19        0    18  0.00    67
20        0    19  0.00    67

You can try this as well.

for y,x in zip(df.columns,df.dtypes):
    if x == 'object':
        df[y]=pd.factorize(df[y])[0]
    elif x=='int64':
        df[y]=df[y].astype(np.int8)
    else:
        df[y]=df[y].astype(np.float32)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM