简体   繁体   English

加入具有不同时区的 Pandas DataFrames 时可能出现的错误

[英]Possible bug when joining pandas DataFrames with different Timezones

I am trying to join two pandas dataframes with the pd.DataFrame.join function, but there is a problem when I try to change the datetime of an existing index.我正在尝试使用 pd.DataFrame.join 函数连接两个 Pandas 数据帧,但是当我尝试更改现有索引的日期时间时出现问题。 I used:我用了:

import pandas as pd
import pytz
import numpy as np

# Creating date ranges
sim_date = pd.date_range('1980-01-01', '1980-12-31', freq='1H', tz='UTC')
obs_date = pd.date_range('1980-01-01', '1980-12-31', freq='1H')

# Creating DataFrames with the time ranges as indices
sim = pd.DataFrame(np.random.rand(len(interim_date)), index=interim_date, columns=['Sim'])
obs = pd.DataFrame(np.random.rand(len(obs_date)), index=obs_date, columns=['Obs'])

# Changing the timezone of the observed data index
obs.index = obs.index.tz_localize('UTC').tz_convert('America/Phoenix')

# Printing the result of the join
print(pd.DataFrame.join(sim, obs).dropna())

Which yields:其中产生:

                                Sim       Obs
1980-01-01 00:00:00+00:00  0.844345  0.117649
1980-01-01 01:00:00+00:00  0.505349  0.755907
1980-01-01 02:00:00+00:00  0.799555  0.169102
1980-01-01 03:00:00+00:00  0.194750  0.704400
1980-01-01 04:00:00+00:00  0.459079  0.241803
1980-01-01 05:00:00+00:00  0.496936  0.726264
1980-01-01 06:00:00+00:00  0.515039  0.989569
1980-01-01 07:00:00+00:00  0.271105  0.488859
1980-01-01 08:00:00+00:00  0.545269  0.434904
1980-01-01 09:00:00+00:00  0.817365  0.067979
1980-01-01 10:00:00+00:00  0.051024  0.068993
1980-01-01 11:00:00+00:00  0.170346  0.510406
1980-01-01 12:00:00+00:00  0.518609  0.583602
1980-01-01 13:00:00+00:00  0.725753  0.402805
1980-01-01 14:00:00+00:00  0.134059  0.879183
1980-01-01 15:00:00+00:00  0.304070  0.773884
1980-01-01 16:00:00+00:00  0.742448  0.158367
1980-01-01 17:00:00+00:00  0.539499  0.067725
1980-01-01 18:00:00+00:00  0.349432  0.027337
1980-01-01 19:00:00+00:00  0.549015  0.078190
1980-01-01 20:00:00+00:00  0.089871  0.878931
1980-01-01 21:00:00+00:00  0.100849  0.359007
1980-01-01 22:00:00+00:00  0.290280  0.168759
1980-01-01 23:00:00+00:00  0.074420  0.881724
1980-01-02 00:00:00+00:00  0.091413  0.820616

but when I use this:但是当我使用这个时:

import pandas as pd
import pytz
import numpy as np

interim_date = pd.date_range('1980-01-01', '1980-01-02', freq='1H', tz='UTC')
obs_date = pd.date_range('1980-01-01', '1980-01-02', freq='1H', tz='America/Phoenix')

sim = pd.DataFrame(np.random.rand(len(interim_date)), index=interim_date, columns=['Sim'])
obs = pd.DataFrame(np.random.rand(len(obs_date)), index=obs_date, columns=['Obs'])

print(pd.DataFrame.join(sim, obs).dropna())

it yields:它产生:

                                Sim       Obs
1980-01-01 07:00:00+00:00  0.894766  0.509333
1980-01-01 08:00:00+00:00  0.805764  0.564251
1980-01-01 09:00:00+00:00  0.996807  0.856853
1980-01-01 10:00:00+00:00  0.494817  0.088286
1980-01-01 11:00:00+00:00  0.716468  0.947045
1980-01-01 12:00:00+00:00  0.808407  0.332764
1980-01-01 13:00:00+00:00  0.554688  0.959215
1980-01-01 14:00:00+00:00  0.389542  0.462384
1980-01-01 15:00:00+00:00  0.039566  0.850724
1980-01-01 16:00:00+00:00  0.634998  0.097579
1980-01-01 17:00:00+00:00  0.169957  0.390812
1980-01-01 18:00:00+00:00  0.113913  0.519487
1980-01-01 19:00:00+00:00  0.521354  0.260055
1980-01-01 20:00:00+00:00  0.910717  0.693063
1980-01-01 21:00:00+00:00  0.907878  0.190714
1980-01-01 22:00:00+00:00  0.625534  0.048584
1980-01-01 23:00:00+00:00  0.926966  0.815481
1980-01-02 00:00:00+00:00  0.841386  0.573255

It seems as though both methods should yield the same result, but it seems like in the first case the observed DataFrame index does not change, even though when I print it out it says that it changed... Any suggestions?似乎两种方法都应该产生相同的结果,但似乎在第一种情况下观察到的 DataFrame 索引没有改变,即使当我打印出来它说它改变了......有什么建议吗?

Your two methods of construction are actually different.你的两种构建方法实际上是不同的。 For example, look at the first element of each index:比如看每个索引的第一个元素:

In [2]: obs_date1 = pd.date_range('1980-01-01', '1980-12-31', freq='1H') \
                      .tz_localize('UTC') \
                      .tz_convert('America/Phoenix')

In [3]: obs_date2 = pd.date_range('1980-01-01', '1980-12-31', freq='1H', 
                                  tz='America/Phoenix')

In [4]: obs_date1[0]
Out[4]: Timestamp('1979-12-31 17:00:00-0700', tz='America/Phoenix', freq='H')

In [5]: obs_date2[0]
Out[5]: Timestamp('1980-01-01 00:00:00-0700', tz='America/Phoenix', freq='H')

By doing .tz_localize('UTC') , you're basically appending the UTC timezone to each timestamp.通过执行.tz_localize('UTC') ,您基本上是将 UTC 时区附加到每个时间戳。 Then by using tz_convert('America/Phoenix') you're shifting the UTC timestamps back seven hours to the America/Phoenix timezone.然后通过使用tz_convert('America/Phoenix')将 UTC 时间戳移回美洲/凤凰时区七小时。 When you use this method and join against the UTC timestamps, this index get shifted back to UTC for alignment, and both indexes exactly match, hence nothing getting dropped in your join .当您使用此方法并根据 UTC 时间戳进行join时,该索引将移回 UTC 以进行对齐,并且两个索引完全匹配,因此您的join不会丢失任何内容。

I think what you want for the first option is to just directly use tz_localize('America/Phoenix') :我认为你想要的第一个选项是直接使用tz_localize('America/Phoenix')

In [6]: obs_date3 = pd.date_range('1980-01-01', '1980-12-31', freq='1H') \
   ...:               .tz_localize('America/Phoenix')

In [7]: obs_date3[0]
Out[7]: Timestamp('1980-01-01 00:00:00-0700', tz='America/Phoenix', freq='H')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM