简体   繁体   中英

Pandas DataFrame casting to timedelta fails with loc

I've got a little bit of a weird situation, and I don't understand why it works in one situation and not the other.

I'm trying to cast a column on a multiindex from timedelta64[ns] to timedelta64[s], and I also have a multiindex for rows. If tuple is the column I want (level_0, level_1):

  • it works with df[tuple] = df[tuple].astype(timedelta64[s])

  • it doesn't work with df.loc[:, tuple].astype(timedelta64[s])


Here is some sample data (csv):

Level_0,,,Respondent,Respondent,Respondent,OtherCat,OtherCat
Level_1,,,Something,StartDate,EndDate,Yes/No,SomethingElse
Region,Site,RespondentID,,,,,
Region_1,Site_1,3987227376,A,5/25/2015 10:59,5/25/2015 11:22,Yes,
Region_1,Site_1,3980680971,A,5/21/2015 9:40,5/21/2015 9:52,Yes,Yes
Region_1,Site_2,3977723249,A,5/20/2015 8:27,5/20/2015 8:41,Yes,
Region_1,Site_2,3977723089,A,5/20/2015 8:33,5/20/2015 9:09,Yes,No

Load it with:

In [1]: df = pd.read_csv(header=[0,1], index_col=[0,1,2])
        df

Out[1]: 

样品

I want to create a column "Duration" (and then one called "DurationMinutes" dividing Duration by 60).

I start by casting the dates to datetime:

In [2]: 

df.loc[:,('Respondent','StartDate')] = pd.to_datetime(sample.loc[:,('Respondent','StartDate')])

df.loc[:,('Respondent','EndDate')] = pd.to_datetime(df.loc[:,('Respondent','EndDate')])
df.loc[:,('Respondent','Duration')] = df.loc[:,('Respondent','EndDate')] - df.loc[:,('Respondent','StartDate')]

This is where I don't understand anymore what's going on. I want to convert it to timedelta64[s] because I need that. If I simply display the result of astype('timedelta64[s]') , it works like a charm:

In [3]: df.loc[:,('Respondent','Duration')].astype('timedelta64[s]')
Out[3]: 
Region    Site    RespondentID
Region_1  Site_1  3987227376      1380
                  3980680971       720
          Site_2  3977723249       840
                  3977723089      2160
Name: (Respondent, Duration), dtype: float64

But if I assign, then show the column, it fails:

In [4]: df.loc[:,('Respondent','Duration')] = df.loc[:,'Respondent','Duration')].astype('timedelta64[s]')
       df.loc[:,('Respondent','Duration')]
Out[4]: 
Region    Site    RespondentID
Region_1  Site_1  3987227376     00:00:00.000001
                  3980680971     00:00:00.000000
          Site_2  3977723249     00:00:00.000000
                  3977723089     00:00:00.000002
Name: (Respondent, Duration), dtype: timedelta64[ns]

Weirdly enough, if I do this: it will work:

In [5]: df[('Respondent','Duration')] = df[('Respondent','Duration')].astype('timedelta64[s]')
        df.loc[:,('Respondent','Duration')]
Out[5]:
Region    Site    RespondentID
Region_1  Site_1  3987227376      1380
                  3980680971       720
          Site_2  3977723249       840
                  3977723089      2160
Name: (Respondent, Duration), dtype: float64

Another strange thing, if I filter for one site, and drop the Region so that I end up with a single-level index, it works...:

In [6]:
Survey = 'Site_1'
df = df.xs(Survey, level='Site').copy()
​
# Drop the 'Region' from index
df.index = df.index.droplevel(level='Region')

df.loc[:,('Respondent','StartDate')] = pd.to_datetime(df.loc[:,('Respondent','StartDate')])
df.loc[:,('Respondent','EndDate')] = pd.to_datetime(df.loc[:,('Respondent','EndDate')])
df.loc[:,('Respondent','Duration')] = df.loc[:,('Respondent','EndDate')] - df.loc[:,('Respondent','StartDate')]

​# This works fine
df.loc[:,('Respondent','Duration')] = df.loc[:,('Respondent','Duration')].astype('timedelta64[s]')
​
# Display
df.loc[:,('Respondent','Duration')]

Out[6]:
RespondentID
3987227376    1380
3980680971     720
Name: (Respondent, Duration), dtype: float64

Clearly I'm missing something as to why df.loc[:,tuple] is different than df[tuple] .

Can someone shed some light please?


Python 2.7.9, pandas 0.16.2

This was a bug, I just fixed it here , will be in 0.17.0.

The gist is this. When you do something like df.loc[:,column] = value this is treated exactly the same as df[[column]] = value . This means that type coercion is independent of what the column WAS. Contrast this to df.loc[indexer,column] , eg you are partially setting a column. Here the new value AND the existing dtype of the column matters.

The bug was that when the frame has a multi-index, even though the multi-index was a full index (eg it encompassed the full length of values in the frame) it wasn't taking the correct path.

So the bottom line is that these cases should (and will be) the same.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM