[英]Pandas DataFrame with date and time from JSON format
I'm importing data from .json
file with pandas DataFrame
and the result is a bit broken: 我是从导入数据
.json
文件与熊猫DataFrame
,结果是有点破:
>> print df
summary response_date
8.0 {u'$date': u'2009-02-19T10:54:00.000+0000'}
11.0 {u'$date': u'2009-02-24T11:23:45.000+0000'}
14.0 {u'$date': u'2009-03-03T17:55:07.000+0000'}
16.0 {u'$date': u'2009-03-10T12:23:04.000+0000'}
19.0 {u'$date': u'2009-03-17T17:19:55.000+0000'}
13.0 {u'$date': u'2009-03-25T15:10:52.000+0000'}
22.0 {u'$date': u'2009-04-02T16:57:31.000+0100'}
15.0 {u'$date': u'2009-04-08T22:29:09.000+0100'}
20.0 {u'$date': u'2009-04-16T18:14:20.000+0100'}
13.0 {u'$date': u'2009-04-29T10:47:06.000+0100'}
15.0 {u'$date': u'2009-05-06T13:45:45.000+0100'}
20.0 {u'$date': u'2009-05-26T10:41:52.000+0100'}
How to get rid of 'date' and other mess to create a normal column with date and time. 如何摆脱“日期”和其他混乱,以创建具有日期和时间的普通列。 To convert from ISO8601 format I normally use:
要从ISO8601格式转换,我通常使用:
df.response_date = pd.to_datetime(df.response_date)
UPDATE 1 更新1
summary response_date closed_date open_date
24.0 2011-10-15T00:00:00.000+0100 NaN NaN
24.0 2011-11-24T09:00:00.000+0000 NaN NaN
19.0 2011-10-01T09:00:00.000+0100 NaN NaN
25.0 2011-10-29T09:00:00.000+0100 NaN NaN
19.0 2011-10-08T09:00:00.000+0100 NaN NaN
-1.0 2011-11-09T17:20:00.000+0000 {u'$date': u'2011-11-16T15:20:00.000+0000'} {u'$date': u'2011-11-09T15:20:00.000+0000'}
-1.0 2011-11-16T17:20:00.000+0000 {u'$date': u'2011-11-23T15:20:00.000+0000'} {u'$date': u'2011-11-16T15:20:00.000+0000'}
-1.0 2011-11-23T17:20:00.000+0000 {u'$date': u'2011-11-30T15:20:00.000+0000'} {u'$date': u'2011-11-23T15:20:00.000+0000'}
-1.0 2011-11-30T17:20:00.000+0000 {u'$date': u'2011-12-07T15:20:00.000+0000'} {u'$date': u'2011-11-30T15:20:00.000+0000'}
So, the 所以
>> df.response_date = pd.DataFrame(df.response_date.values.tolist())
worked perfectly, but other columns contain NaN values, and imputing with "-1" doesn't help. 效果很好,但是其他列包含NaN值,并且使用“ -1”进行插补无济于事。
>> print type(df.ix[0,'scheduleClosedAt'])
<type 'int'>
UPDATED 2 更新2
Why this (masking) method does not work? 为什么此(屏蔽)方法不起作用?
>> df.reset_index(inplace=True)
>> indx_nan_closed = df.closed_date.isnull()
>> df[~indx_nan_closed].closed_date = pd.DataFrame(df[~indx_nan_closed].closed_date.values.tolist())
This line is equivalent to the one in above, but with masking array, so I want to apply this method only to non-NaN values, but the result is that my data frame "df" remains unchanged. 该行与上面的行等效,但是具有掩码数组,因此我只想将此方法应用于非NaN值,但是结果是我的数据帧“ df”保持不变。 This is quite strange.
这很奇怪。
Any thoughts? 有什么想法吗?
You can use DataFrame
constructor with converting column response_date
to list
by values
if type
is dict
: 您可以使用
DataFrame
的构造与转化列response_date
到list
的values
,如果type
是dict
:
print (type(df.ix[0,'response_date']))
<class 'dict'>
df.response_date = pd.DataFrame(df.response_date.values.tolist())
df.response_date = pd.to_datetime(df.response_date)
print (df)
summary response_date
0 8.0 2009-02-19 10:54:00
1 11.0 2009-02-24 11:23:45
2 14.0 2009-03-03 17:55:07
If type
is string
, use split
and strip
: 如果
type
是string
,请使用split
和strip
:
print (type(df.ix[0,'response_date']))
<class 'str'>
df.response_date = df.response_date.str.split().str[1].str.strip("'u}")
df.response_date = pd.to_datetime(df.response_date)
print (df)
summary response_date
0 8.0 2009-02-19 10:54:00
1 11.0 2009-02-24 11:23:45
2 14.0 2009-03-03 17:55:07
EDIT by comment: 通过评论编辑:
2 possible solutions: 2种可能的解决方案:
First is fillna
by empty dict
: 首先是
fillna
由空dict
:
df.closed_date = df.closed_date.fillna(pd.Series([{}]))
another is boolean indexing
: 另一个是
boolean indexing
:
import numpy as np
import pandas as pd
df = pd.DataFrame({'summary':[19.0, -1.0,-1.0],
'response_date':['2011-10-08T09:00:00.000+0100','2011-11-09T17:20:00.000+0000','2011-11-16T17:20:00.000+0000'],
'closed_date':[np.nan, {u'$date': u'2011-11-16T15:20:00.000+0000'}, {u'$date': u'2011-11-23T15:20:00.000+0000'}]},
columns=['summary','response_date','closed_date'])
print (df)
summary response_date \
0 19.0 2011-10-08T09:00:00.000+0100
1 -1.0 2011-11-09T17:20:00.000+0000
2 -1.0 2011-11-16T17:20:00.000+0000
closed_date
0 NaN
1 {'$date': '2011-11-16T15:20:00.000+0000'}
2 {'$date': '2011-11-23T15:20:00.000+0000'}
a = df.ix[df.closed_date.notnull(), 'closed_date']
print (a)
1 {'$date': '2011-11-16T15:20:00.000+0000'}
2 {'$date': '2011-11-23T15:20:00.000+0000'}
Name: closed_date, dtype: object
df['closed_date'] = pd.DataFrame(a.values.tolist(), index=a.index)
df.closed_date = pd.to_datetime(df.closed_date)
print (df)
summary response_date closed_date
0 19.0 2011-10-08T09:00:00.000+0100 NaT
1 -1.0 2011-11-09T17:20:00.000+0000 2011-11-16 15:20:00
2 -1.0 2011-11-16T17:20:00.000+0000 2011-11-23 15:20:00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.