简体   繁体   English

如何在熊猫中将列转换为一个日期时间列?

[英]How to convert columns into one datetime column in pandas?

I have a dataframe where the first 3 columns are 'MONTH', 'DAY', 'YEAR'我有一个数据框,其中前 3 列是“MONTH”、“DAY”、“YEAR”

In each column there is an integer.在每一列中都有一个整数。 Is there a Pythonic way to convert all three columns into datetimes while there are in the dataframe?有没有一种 Pythonic 方法可以在数据框中将所有三列转换为日期时间?

From:从:

M    D    Y    Apples   Oranges
5    6  1990      12        3
5    7  1990      14        4
5    8  1990      15       34
5    9  1990      23       21

into:进入:

Datetimes    Apples   Oranges
1990-6-5        12        3
1990-7-5        14        4
1990-8-5        15       34
1990-9-5        23       21

In version 0.18.1 you can use to_datetime , but:0.18.1版本中,您可以使用to_datetime ,但是:

  • The names of the columns have to be year , month , day , hour , minute and second :列的名称必须是yearmonthdayhourminutesecond
  • Minimal columns are year , month and day最小的列是yearmonthday

Sample:样本:

import pandas as pd

df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                    'day': [4, 5],
                    'hour': [2, 3],
                    'minute': [10, 30],
                    'second': [21,25]})

print df
   day  hour  minute  month  second  year
0    4     2      10      2      21  2015
1    5     3      30      3      25  2016

print pd.to_datetime(df[['year', 'month', 'day']])
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

print pd.to_datetime(df[['year', 'month', 'day', 'hour']])
0   2015-02-04 02:00:00
1   2016-03-05 03:00:00
dtype: datetime64[ns]

print pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])
0   2015-02-04 02:10:00
1   2016-03-05 03:30:00
dtype: datetime64[ns]

print pd.to_datetime(df)
0   2015-02-04 02:10:21
1   2016-03-05 03:30:25
dtype: datetime64[ns]

Another solution is convert to dictionary :另一种解决方案是转换为dictionary

print df
   M  D     Y  Apples  Oranges
0  5  6  1990      12        3
1  5  7  1990      14        4
2  5  8  1990      15       34
3  5  9  1990      23       21

print pd.to_datetime(dict(year=df.Y, month=df.M, day=df.D))
0   1990-05-06
1   1990-05-07
2   1990-05-08
3   1990-05-09
dtype: datetime64[ns]

In 0.13 (coming very soon), this is heavily optimized and quite fast (but still pretty fast in 0.12);在 0.13 中(即将推出),这是经过高度优化且非常快(但在 0.12 中仍然非常快); both orders of magnitude faster than looping两个数量级都比循环快

In [3]: df
Out[3]: 
   M  D     Y  Apples  Oranges
0  5  6  1990      12        3
1  5  7  1990      14        4
2  5  8  1990      15       34
3  5  9  1990      23       21

In [4]: df.dtypes
Out[4]: 
M          int64
D          int64
Y          int64
Apples     int64
Oranges    int64
dtype: object

# in 0.12, use this
In [5]: pd.to_datetime((df.Y*10000+df.M*100+df.D).apply(str),format='%Y%m%d')

# in 0.13 the above or this will work
In [5]: pd.to_datetime(df.Y*10000+df.M*100+df.D,format='%Y%m%d')
Out[5]: 
0   1990-05-06 00:00:00
1   1990-05-07 00:00:00
2   1990-05-08 00:00:00
3   1990-05-09 00:00:00
dtype: datetime64[ns]

Here is a alternative which uses NumPy datetime64 and timedelta64 arithmetic .这是使用NumPy datetime64 和 timedelta64 算术的替代方法。 It appears to be a bit faster for small DataFrames and much faster for larger DataFrames:对于小型 DataFrame,它似乎要快一些,而对于较大的 DataFrame,它似乎要快得多:

import numpy as np
import pandas as pd

df = pd.DataFrame({'M':[1,2,3,4], 'D':[6,7,8,9], 'Y':[1990,1991,1992,1993]})
#    D  M     Y
# 0  6  1  1990
# 1  7  2  1991
# 2  8  3  1992
# 3  9  4  1993

y = np.array(df['Y']-1970, dtype='<M8[Y]')
m = np.array(df['M']-1, dtype='<m8[M]')
d = np.array(df['D']-1, dtype='<m8[D]')
dates2 = pd.Series(y+m+d)
# 0   1990-01-06
# 1   1991-02-07
# 2   1992-03-08
# 3   1993-04-09
# dtype: datetime64[ns]

In [214]: df = pd.concat([df]*1000)

In [215]: %timeit pd.to_datetime((df['Y']*10000+df['M']*100+df['D']).astype('int'), format='%Y%m%d')
100 loops, best of 3: 4.87 ms per loop

In [216]: %timeit pd.Series(np.array(df['Y']-1970, dtype='<M8[Y]')+np.array(df['M']-1, dtype='<m8[M]')+np.array(df['D']-1, dtype='<m8[D]'))
1000 loops, best of 3: 839 µs per loop

Here's a helper function to make this easier to use:这是一个帮助函数,可以使它更易于使用:

def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
              seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
    years = np.asarray(years) - 1970
    months = np.asarray(months) - 1
    days = np.asarray(days) - 1
    types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
             '<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
    vals = (years, months, days, weeks, hours, minutes, seconds,
            milliseconds, microseconds, nanoseconds)
    return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
               if v is not None)

In [437]: combine64(df['Y'], df['M'], df['D'])
Out[437]: array(['1990-01-06', '1991-02-07', '1992-03-08', '1993-04-09'], dtype='datetime64[D]')

I re-approached the problem and I think I found a solution.我重新解决了这个问题,我想我找到了解决方案。 I initialized the csv file in the following way:我通过以下方式初始化了 csv 文件:

pandas_object = DataFrame(read_csv('/Path/to/csv/file', parse_dates=True, index_col = [2,0,1] ))

Where the:其中:

index_col = [2,0,1]

represents the columns of the [year, month, day]表示[年、月、日]的列

Only problem now is that now I have three new index columns, one represent the year, another the month, and another the day.现在唯一的问题是,现在我有了三个新的索引列,一个代表年份,另一个代表月份,另一个代表日期。

Convert the dataframe to strings for easy string concatenation:将数据帧转换为字符串以便于字符串连接:

df=df.astype(str)

then convert to datetime, specify the format:然后转换为日期时间,指定格式:

df.index=pd.to_datetime(df.Y+df.M+df.D,format="%Y%m%d")

which replaces the index rather than creating a new column.它替换索引而不是创建新列。

Even better way to do is as below:更好的方法如下:

import pandas as pd

import datetime

dataset = pd.read_csv('dataset.csv')

date=dataset.apply(lambda x: datetime.date(int(x['Yr']), x['Mo'], x['Dy']),axis=1)

date = pd.to_datetime(date)

dataset = dataset.drop(columns=['Yr', 'Mo', 'Dy'])

dataset.insert(0, 'Date', date)

dataset.head()

 [pd.to_datetime(str(a)+str(b)+str(c),
                 format='%m%d%Y'
                ) for a,b,c in zip(df.M, df.D, df.Y)]

Let's assume you've got a dictionary foo with each column of dates in parallel.假设您有一个字典foo ,其中每一列日期都是并行的。 If so, here's your one liner:如果是这样,这是你的一个班轮:

>>> from datetime import datetime
>>> foo = {"M": [1,2,3], "D":[30,30,21], "Y":[1980,1981,1982]}
>>>
>>> df = pd.DataFrame({"Datetime": [datetime(y,m,d) for y,m,d in zip(foo["Y"],foo["M"],foo["D"])]})

The real guts of it are this bit:它的真正胆量是这一点:

>>> [datetime(y,m,d) for y,m,d in zip(foo["Y"],foo["M"],foo["D"])]
[datetime.datetime(1980, 1, 30, 0, 0), datetime.datetime(1981, 2, 28, 0, 0), datetime.datetime(1982, 3, 21, 0, 0)]

This is the sort of thing zip was made for.这就是zip的用途。 It takes parallel lists and turns them into tuples.它采用并行列表并将它们转换为元组。 Then they get tuple unpacked (the for y,m,d in bit) by the list comprehension there, then fed into the datetime object constructor.然后它们通过那里的列表理解解包元组( for y,m,d in bit),然后输入到datetime对象构造函数中。

pandas seems happy with the datetime objects. pandas似乎对 datetime 对象很满意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM