具有JSON格式的日期和时间的Pandas DataFrame

Question

I'm importing data from .json file with pandas DataFrame and the result is a bit broken: 我是从导入数据.json文件与熊猫DataFrame ，结果是有点破：

              >> print df
              summary                                response_date
                  8.0  {u'$date': u'2009-02-19T10:54:00.000+0000'}
                 11.0  {u'$date': u'2009-02-24T11:23:45.000+0000'}
                 14.0  {u'$date': u'2009-03-03T17:55:07.000+0000'}
                 16.0  {u'$date': u'2009-03-10T12:23:04.000+0000'}
                 19.0  {u'$date': u'2009-03-17T17:19:55.000+0000'}
                 13.0  {u'$date': u'2009-03-25T15:10:52.000+0000'}
                 22.0  {u'$date': u'2009-04-02T16:57:31.000+0100'}
                 15.0  {u'$date': u'2009-04-08T22:29:09.000+0100'}
                 20.0  {u'$date': u'2009-04-16T18:14:20.000+0100'}
                 13.0  {u'$date': u'2009-04-29T10:47:06.000+0100'}
                 15.0  {u'$date': u'2009-05-06T13:45:45.000+0100'}
                 20.0  {u'$date': u'2009-05-26T10:41:52.000+0100'}

How to get rid of 'date' and other mess to create a normal column with date and time. 如何摆脱“日期”和其他混乱，以创建具有日期和时间的普通列。 To convert from ISO8601 format I normally use: 要从ISO8601格式转换，我通常使用：

df.response_date = pd.to_datetime(df.response_date)

UPDATE 1 更新1

       summary                 response_date                                  closed_date                                    open_date
          24.0  2011-10-15T00:00:00.000+0100                                          NaN                                          NaN
          24.0  2011-11-24T09:00:00.000+0000                                          NaN                                          NaN
          19.0  2011-10-01T09:00:00.000+0100                                          NaN                                          NaN
          25.0  2011-10-29T09:00:00.000+0100                                          NaN                                          NaN
          19.0  2011-10-08T09:00:00.000+0100                                          NaN                                          NaN
          -1.0  2011-11-09T17:20:00.000+0000  {u'$date': u'2011-11-16T15:20:00.000+0000'}  {u'$date': u'2011-11-09T15:20:00.000+0000'}
          -1.0  2011-11-16T17:20:00.000+0000  {u'$date': u'2011-11-23T15:20:00.000+0000'}  {u'$date': u'2011-11-16T15:20:00.000+0000'}
          -1.0  2011-11-23T17:20:00.000+0000  {u'$date': u'2011-11-30T15:20:00.000+0000'}  {u'$date': u'2011-11-23T15:20:00.000+0000'}
          -1.0  2011-11-30T17:20:00.000+0000  {u'$date': u'2011-12-07T15:20:00.000+0000'}  {u'$date': u'2011-11-30T15:20:00.000+0000'}

So, the 所以

>> df.response_date = pd.DataFrame(df.response_date.values.tolist())

worked perfectly, but other columns contain NaN values, and imputing with "-1" doesn't help. 效果很好，但是其他列包含NaN值，并且使用“ -1”进行插补无济于事。

>> print type(df.ix[0,'scheduleClosedAt'])
<type 'int'>

UPDATED 2 更新2

Why this (masking) method does not work? 为什么此（屏蔽）方法不起作用？

>> df.reset_index(inplace=True)
>> indx_nan_closed = df.closed_date.isnull()
>> df[~indx_nan_closed].closed_date = pd.DataFrame(df[~indx_nan_closed].closed_date.values.tolist())

This line is equivalent to the one in above, but with masking array, so I want to apply this method only to non-NaN values, but the result is that my data frame "df" remains unchanged. 该行与上面的行等效，但是具有掩码数组，因此我只想将此方法应用于非NaN值，但是结果是我的数据帧“ df”保持不变。 This is quite strange. 这很奇怪。

Any thoughts? 有什么想法吗？

Answer 1

You can use DataFrame constructor with converting column response_date to list by values if type is dict : 您可以使用DataFrame的构造与转化列response_date到list的values ，如果type是dict ：

print (type(df.ix[0,'response_date']))
<class 'dict'>

df.response_date = pd.DataFrame(df.response_date.values.tolist())
df.response_date = pd.to_datetime(df.response_date)
print (df)
   summary       response_date
0      8.0 2009-02-19 10:54:00
1     11.0 2009-02-24 11:23:45
2     14.0 2009-03-03 17:55:07

If type is string , use split and strip : 如果type是string ，请使用split和strip ：

print (type(df.ix[0,'response_date']))
<class 'str'>

df.response_date = df.response_date.str.split().str[1].str.strip("'u}")
df.response_date = pd.to_datetime(df.response_date)

print (df)
   summary       response_date
0      8.0 2009-02-19 10:54:00
1     11.0 2009-02-24 11:23:45
2     14.0 2009-03-03 17:55:07

EDIT by comment: 通过评论编辑：

2 possible solutions: 2种可能的解决方案：

First is fillna by empty dict : 首先是fillna由空dict ：

df.closed_date = df.closed_date.fillna(pd.Series([{}]))

another is boolean indexing : 另一个是boolean indexing ：

import numpy as np
import pandas as pd

df = pd.DataFrame({'summary':[19.0, -1.0,-1.0],
                   'response_date':['2011-10-08T09:00:00.000+0100','2011-11-09T17:20:00.000+0000','2011-11-16T17:20:00.000+0000'],
              'closed_date':[np.nan, {u'$date': u'2011-11-16T15:20:00.000+0000'}, {u'$date': u'2011-11-23T15:20:00.000+0000'}]},
                   columns=['summary','response_date','closed_date'])

print (df)
   summary                 response_date  \
0     19.0  2011-10-08T09:00:00.000+0100   
1     -1.0  2011-11-09T17:20:00.000+0000   
2     -1.0  2011-11-16T17:20:00.000+0000   

                                 closed_date  
0                                        NaN  
1  {'$date': '2011-11-16T15:20:00.000+0000'}  
2  {'$date': '2011-11-23T15:20:00.000+0000'}

a = df.ix[df.closed_date.notnull(), 'closed_date'] 
print (a)
1    {'$date': '2011-11-16T15:20:00.000+0000'}
2    {'$date': '2011-11-23T15:20:00.000+0000'}
Name: closed_date, dtype: object

df['closed_date'] = pd.DataFrame(a.values.tolist(), index=a.index)
df.closed_date = pd.to_datetime(df.closed_date)
print (df)

   summary                 response_date         closed_date
0     19.0  2011-10-08T09:00:00.000+0100                 NaT
1     -1.0  2011-11-09T17:20:00.000+0000 2011-11-16 15:20:00
2     -1.0  2011-11-16T17:20:00.000+0000 2011-11-23 15:20:00

具有JSON格式的日期和时间的Pandas DataFrame

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-09-23 11:10:05

具有JSON格式的日期和时间的Pandas DataFrame

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-09-23 11:10:05

解决方案1
2 已采纳 2016-09-23 11:10:05