dask dataframe如何将列转换为to_datetime

Question

我正在尝试将我的数据帧的一列转换为datetime。 在这里讨论之后https://github.com/dask/dask/issues/863我尝试了以下代码：

import dask.dataframe as dd
df['time'].map_partitions(pd.to_datetime, columns='time').compute()

但我收到以下错误消息

ValueError: Metadata inference failed, please provide `meta` keyword

究竟应该把什么放在元下？ 我应该在df中或仅在'time'列中放置所有列的字典吗？ 我应该放什么类型的？ 我尝试过dtype和datetime64，但到目前为止它们都没有。

谢谢你，我感谢你的指导，

更新

我将在这里包含新的错误消息：

1）使用时间戳

df['trd_exctn_dt'].map_partitions(pd.Timestamp).compute()

TypeError: Cannot convert input to Timestamp

2）使用datetime和meta

meta = ('time', pd.Timestamp)
df['time'].map_partitions(pd.to_datetime,meta=meta).compute()
TypeError: to_datetime() got an unexpected keyword argument 'meta'

3）只使用日期时间：陷入2％

    In [14]: df['trd_exctn_dt'].map_partitions(pd.to_datetime).compute()
[                                        ] | 2% Completed |  2min 20.3s

此外，我希望能够在日期中指定格式，就像我在pandas中所做的那样：

pd.to_datetime(df['time'], format = '%m%d%Y'

更新2

更新到Dask 0.11后，我不再遇到meta关键字问题。 不过，我无法在2GB数据帧上超过2％。

df['trd_exctn_dt'].map_partitions(pd.to_datetime, meta=meta).compute()
    [                                        ] | 2% Completed |  30min 45.7s

更新3

这样做得更好：

def parse_dates(df):
  return pd.to_datetime(df['time'], format = '%m/%d/%Y')

df.map_partitions(parse_dates, meta=meta)

我不确定这是否是正确的做法

Answer 1

使用`astype`

您可以使用astype方法将系列的astype转换为NumPy dtype

df.time.astype('M8[us]')

可能还有一种指定Pandas样式dtype的方法（编辑欢迎）

使用map_partitions和meta

当使用像map_partitions这样的黑盒子方法时，dask.dataframe需要知道输出的类型和名称。 在map_partitions的docstring中列出了几种方法。

您可以使用正确的dtype和名称提供空的Pandas对象

meta = pd.Series([], name='time', dtype=pd.Timestamp)

或者，您可以为系列提供元组(name, dtype)或为DataFrame提供dict

meta = ('time', pd.Timestamp)

那一切都应该没问题

df.time.map_partitions(pd.to_datetime, meta=meta)

如果你在df上调用map_partitions ，那么你需要为所有东西提供dtypes。 但是在你的例子中并非如此。

Answer 2

Dask也带有to_timedelta所以这应该也可以。

df['time']=dd.to_datetime(df.time,unit='ns')

值单位取与pandas中的pd.to_timedelta相同。 这可以在这里找到。

Answer 3

我不确定这是否是正确的方法，但映射列对我有用：

df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))

Answer 4

这对我有用

ddf["Date"] = ddf["Date"].map_partitions(pd.to_datetime,format='%d/%m/%Y',meta = ('datetime64[ns]'))

Answer 5

如果日期时间是非ISO格式，则map_partition产生更好的结果：

import dask
import pandas as pd
from dask.distributed import Client
client = Client()

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime = ddf.datetime.astype('M8[s]')
ddf.compute()

每循环11.3 s±719 ms（平均值±标准偏差，7次运行，每次1次循环）

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 


%%timeit
ddf.datetime_nonISO = (ddf.datetime_nonISO.map_partitions(pd.to_datetime
                       ,  format='%H:%M:%S %Y-%m-%d', meta=('datetime64[s]')))
ddf.compute()

每回路8.78 s±599 ms（平均值±标准偏差，7次运行，每次1次循环）

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime_nonISO = ddf.datetime_nonISO.astype('M8[s]')
ddf.compute()

每循环1分8s±3.65秒（平均值±标准偏差，7次运行，每次循环1次）

dask dataframe如何将列转换为to_datetime

问题描述

5 个解决方案

解决方案1
12 已采纳 2016-09-20 11:49:19

使用`astype`

使用map_partitions和meta

解决方案2
5 2019-04-16 10:51:07

解决方案3
4 2018-01-28 13:34:15

解决方案4
2 2018-10-02 03:11:21

解决方案5
0 2019-05-13 15:07:55

dask dataframe如何将列转换为to_datetime

问题描述

5 个解决方案

解决方案1 12 已采纳 2016-09-20 11:49:19

使用astype

使用map_partitions和meta

解决方案2 5 2019-04-16 10:51:07

解决方案3 4 2018-01-28 13:34:15

解决方案4 2 2018-10-02 03:11:21

解决方案5 0 2019-05-13 15:07:55

解决方案1
12 已采纳 2016-09-20 11:49:19

使用`astype`

解决方案2
5 2019-04-16 10:51:07

解决方案3
4 2018-01-28 13:34:15

解决方案4
2 2018-10-02 03:11:21

解决方案5
0 2019-05-13 15:07:55