![](/img/trans.png)
[英]Issue with converting a pandas column from int64 to datetime64
[英]Converting column from int64 to datetime in hdf5 file using Python's Pandas package
我是熊貓和編程的新手,所以對您的幫助將不勝感激。
我很難將從hdf5文件加載的Pandas數據框中的數據列轉換為datetime對象。 數據太大,無法使用文本文件,因此我使用以下代碼將其轉換為hdf5文件:
# get text file from zip file and unzip
file = urllib.request.urlretrieve(file, dir)
z = zipfile.ZipFile(dir)
data = z.open(z.namelist()[0])
# column names from text file
colnames = ['Patent#','App#','Small','Filing Date','Issue Date', 'Event Date', 'Event Code']
# load the data in chunks and concat into single DataFrame
mfees = pd.read_table(data, index_col=0, sep='\s+', header = None, names = colnames, chunksize=1000, iterator=True)
df = pd.concat([chunk for chunk in mfees], ignore_index=False)
# close files
z.close()
data.close()
# convert to hdf5 file
data = data.to_hdf('mfees.h5','raw_data',format='table')
之后,我的數據采用以下格式:
data['Filing Date']
輸出:
Patent#
4287053 19801222
4287053 19801222
4289713 19810105
4289713 19810105
4289713 19810105
4289713 19810105
4289713 19810105
4289713 19810105
Name: Filing Date, Length: 11887679, dtype: int64
但是,當我使用to_datetime函數時,得到以下信息:
data['Filing Date'] = pd.to_datetime(data['Filing Date'])
data['Filing Date']
輸出:
Patent#
4287053 1970-01-01 00:00:00.019801222
4287053 1970-01-01 00:00:00.019801222
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4289713 1970-01-01 00:00:00.019810105
4291808 1970-01-01 00:00:00.019801212
4291808 1970-01-01 00:00:00.019801212
4292069 1970-01-01 00:00:00.019810123
4292069 1970-01-01 00:00:00.019810123
4292069 1970-01-01 00:00:00.019810123
4292069 1970-01-01 00:00:00.019810123
Name: Filing Date, Length: 11887679, dtype: datetime64[ns]
我不確定為什么我得到上述datetime對象的輸出。 有什么我可以糾正的方法並將日期正確轉換為datetime對象嗎? 謝謝!
讀parse_dates=[1]
最容易轉換(請注意,我復制了粘貼的數據,因此您只需要添加parse_dates=[1]
選項
In [31]: df = read_csv(StringIO(data),sep='\s+',header=None,parse_dates=[1],names=['num','date']).set_index('num')
In [32]: df
Out[32]:
date
num
4287053 1980-12-22 00:00:00
4287053 1980-12-22 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
In [33]: df.dtypes
Out[33]:
date datetime64[ns]
dtype: object
然后hdf將處理列
In [46]: df.to_hdf('test.h5','df',mode='w',format='table')
In [47]: pd.read_hdf('test.h5','df')
Out[47]:
date
num
4287053 1980-12-22 00:00:00
4287053 1980-12-22 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
4289713 1981-01-05 00:00:00
In [48]: pd.read_hdf('test.h5','df').dtypes
Out[48]:
date datetime64[ns]
dtype: object
這是一個類似int的日期的轉換器,應該很快
In [18]: s = Series([19801222,19801222] + [19810105]*5)
In [19]: s
Out[19]:
0 19801222
1 19801222
2 19810105
3 19810105
4 19810105
5 19810105
6 19810105
dtype: int64
In [20]: s = s.values.astype(object)
In [21]: Series(pd.lib.try_parse_year_month_day(s/10000,s/100 % 100, s % 100))
Out[21]:
0 1980-12-22 00:00:00
1 1980-12-22 00:00:00
2 1981-01-05 00:00:00
3 1981-01-05 00:00:00
4 1981-01-05 00:00:00
5 1981-01-05 00:00:00
6 1981-01-05 00:00:00
dtype: datetime64[ns]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.