繁体   English   中英

使用Python的Pandas包将列从int64转换为hdf5文件中的datetime

[英]Converting column from int64 to datetime in hdf5 file using Python's Pandas package

我是熊猫和编程的新手,所以对您的帮助将不胜感激。

我很难将从hdf5文件加载的Pandas数据框中的数据列转换为datetime对象。 数据太大,无法使用文本文件,因此我使用以下代码将其转换为hdf5文件:

# get text file from zip file and unzip
file = urllib.request.urlretrieve(file, dir)           
z = zipfile.ZipFile(dir)             
data = z.open(z.namelist()[0])

# column names from text file
colnames = ['Patent#','App#','Small','Filing Date','Issue Date', 'Event Date', 'Event Code'] 

# load the data in chunks and concat into single DataFrame        
mfees = pd.read_table(data, index_col=0, sep='\s+', header = None, names = colnames, chunksize=1000, iterator=True)
df = pd.concat([chunk for chunk in mfees], ignore_index=False)

# close files        
z.close()
data.close()

# convert to hdf5 file
data = data.to_hdf('mfees.h5','raw_data',format='table')

之后,我的数据采用以下格式:

data['Filing Date']

输出:

Patent#
4287053    19801222
4287053    19801222
4289713    19810105
4289713    19810105
4289713    19810105
4289713    19810105
4289713    19810105
4289713    19810105
Name: Filing Date, Length: 11887679, dtype: int64

但是,当我使用to_datetime函数时,得到以下信息:

data['Filing Date'] = pd.to_datetime(data['Filing Date'])
data['Filing Date']

输出:

Patent#
4287053   1970-01-01 00:00:00.019801222
4287053   1970-01-01 00:00:00.019801222
4289713   1970-01-01 00:00:00.019810105
4289713   1970-01-01 00:00:00.019810105
4289713   1970-01-01 00:00:00.019810105
4289713   1970-01-01 00:00:00.019810105
4289713   1970-01-01 00:00:00.019810105
4289713   1970-01-01 00:00:00.019810105
4289713   1970-01-01 00:00:00.019810105
4291808   1970-01-01 00:00:00.019801212
4291808   1970-01-01 00:00:00.019801212
4292069   1970-01-01 00:00:00.019810123
4292069   1970-01-01 00:00:00.019810123
4292069   1970-01-01 00:00:00.019810123
4292069   1970-01-01 00:00:00.019810123
Name: Filing Date, Length: 11887679, dtype: datetime64[ns]

我不确定为什么我得到上述datetime对象的输出。 有什么我可以纠正的方法并将日期正确转换为datetime对象吗? 谢谢!

parse_dates=[1]最容易转换(请注意,我复制了粘贴的数据,因此您只需要添加parse_dates=[1]选项

In [31]: df = read_csv(StringIO(data),sep='\s+',header=None,parse_dates=[1],names=['num','date']).set_index('num')

In [32]: df
Out[32]: 
                       date
num                        
4287053 1980-12-22 00:00:00
4287053 1980-12-22 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00

In [33]: df.dtypes
Out[33]: 
date    datetime64[ns]
dtype: object

然后hdf将处理列

In [46]: df.to_hdf('test.h5','df',mode='w',format='table')

In [47]: pd.read_hdf('test.h5','df')
Out[47]: 
                       date
num                        
4287053 1980-12-22 00:00:00
4287053 1980-12-22 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00

In [48]: pd.read_hdf('test.h5','df').dtypes
Out[48]: 
date    datetime64[ns]
dtype: object

这是一个类似int的日期的转换器,应该很快

In [18]: s = Series([19801222,19801222] + [19810105]*5)

In [19]: s
Out[19]: 
0    19801222
1    19801222
2    19810105
3    19810105
4    19810105
5    19810105
6    19810105
dtype: int64

In [20]: s = s.values.astype(object)

In [21]: Series(pd.lib.try_parse_year_month_day(s/10000,s/100 % 100, s % 100))
Out[21]: 
0   1980-12-22 00:00:00
1   1980-12-22 00:00:00
2   1981-01-05 00:00:00
3   1981-01-05 00:00:00
4   1981-01-05 00:00:00
5   1981-01-05 00:00:00
6   1981-01-05 00:00:00
dtype: datetime64[ns]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM