[英]Converting string/numerical data to categorical format in pandas
I have a very large csv file that I have converted to a Pandas dataframe, which has string and integer/float values. 我有一个非常大的csv文件,已将其转换为Pandas数据帧,该数据帧具有字符串和整数/浮点值。 I would like to change this data to categorical format in order to try and save some memory.
我想将此数据更改为分类格式,以尝试节省一些内存。 I am basing this idea off of the documentation here: https://pandas.pydata.org/pandas-docs/version/0.20/categorical.html
我将这个想法基于以下文档: https : //pandas.pydata.org/pandas-docs/version/0.20/categorical.html
My dataframe looks like the following: 我的数据框如下所示:
clean_data_measurements.head(20)
station date prcp tobs
0 USC00519397 1/1/2010 0.08 65
1 USC00519397 1/2/2010 0.00 63
2 USC00519397 1/3/2010 0.00 74
3 USC00519397 1/4/2010 0.00 76
5 USC00519397 1/7/2010 0.06 70
6 USC00519397 1/8/2010 0.00 64
7 USC00519397 1/9/2010 0.00 68
8 USC00519397 1/10/2010 0.00 73
9 USC00519397 1/11/2010 0.01 64
10 USC00519397 1/12/2010 0.00 61
11 USC00519397 1/14/2010 0.00 66
12 USC00519397 1/15/2010 0.00 65
13 USC00519397 1/16/2010 0.00 68
14 USC00519397 1/17/2010 0.00 64
15 USC00519397 1/18/2010 0.00 72
16 USC00519397 1/19/2010 0.00 66
17 USC00519397 1/20/2010 0.00 66
18 USC00519397 1/21/2010 0.00 69
19 USC00519397 1/22/2010 0.00 67
20 USC00519397 1/23/2010 0.00 67
It is precipitation data which goes on another 2700 rows. 这是降水量数据,另外还有2700行。 Since it is all of the same category (station number), it should be convertible to categorical format which will save processing time.
由于它们属于同一类别(站号),因此应将其转换为分类格式,这样可以节省处理时间。 I am just unsure of how to write the code.
我只是不确定如何编写代码。 Can anyone help?
有人可以帮忙吗? Thanks.
谢谢。
I think we can convert object to category data by using factorize
我认为我们可以通过使用
factorize
将对象转换为类别数据
objectdf=df.select_dtypes(include='object')
df.loc[:,objectdf.columns]=objectdf.apply(lambda x : pd.factorize(x)[0])
df
Out[452]:
station date prcp tobs
0 0 0 0.08 65
1 0 1 0.00 63
2 0 2 0.00 74
3 0 3 0.00 76
5 0 4 0.06 70
6 0 5 0.00 64
7 0 6 0.00 68
8 0 7 0.00 73
9 0 8 0.01 64
10 0 9 0.00 61
11 0 10 0.00 66
12 0 11 0.00 65
13 0 12 0.00 68
14 0 13 0.00 64
15 0 14 0.00 72
16 0 15 0.00 66
17 0 16 0.00 66
18 0 17 0.00 69
19 0 18 0.00 67
20 0 19 0.00 67
You can try this as well. 您也可以尝试一下。
for y,x in zip(df.columns,df.dtypes):
if x == 'object':
df[y]=pd.factorize(df[y])[0]
elif x=='int64':
df[y]=df[y].astype(np.int8)
else:
df[y]=df[y].astype(np.float32)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.