根据另一个 dataframe 的日期范围对多个 Dask dataframe 进行切片的最快方法

Question

What would be the fastest approach to select only the dates in df1 date ranges from a ddf2 Dask dataframe? select 的最快方法是什么，只有 df1 日期范围内的日期来自 ddf2 Dask dataframe？ All dates out of ranges should be dropped.应删除所有超出范围的日期。

df1 - Pandas dataframe with start end date ranges df1 - Pandas dataframe 与开始结束日期范围

        start       end
01 2018-06-25 2018-06-29
02 2019-05-06 2019-05-13
...

dd2 - Dask dataframe (30M rows) dd2 - Dask dataframe（30M 行）

(*) marked rows has to be selected (*) 必须选择标记的行

    date        value1
    2018-01-01  23
    2018-01-01  24
    2018-01-02  545
    2018-01-03  433
    2018-01-04  23
    *2018-06-25 234
    *2018-06-25 50
    *2018-06-25 120
    *2018-06-26 22
    *2018-06-27 32       
    *2018-06-27 123
    *2018-06-28 603
    *2018-06-29 625
    2019-01-01  734
    2019-01-01  241
    2019-01-01  231
    2019-01-02  211
    2019-01-02  214
    2019-05-05  234
    2019-05-05  111
    *2019-05-06 846
    *2019-05-06 231
    *2019-05-07 654
    *2019-05-07 119
    *2019-05-08 212
    *2019-05-08 122
    *2019-05-06 765
    *2019-05-13 231
    *2019-05-13 213
    *2019-05-13 443
    2019-05-14  321
    2019-05-14  231
    2019-05-15  123
...

Output: Dask dataframe need it with appended slices Output：Dask dataframe 需要附加切片

date        value1   
2018-06-25  234
2018-06-25  50
2018-06-25  120
2018-06-26  22
2018-06-27  32
2018-06-27  123
2018-06-28  603
2018-06-29  625
2019-05-06  846
2019-05-06  231
2019-05-07  654
2019-05-07  119
2019-05-08  212
2019-05-08  122
2019-05-06  765
2019-05-13  231
2019-05-13  213
2019-05-13  443

This code is working but I need to pass start & end date ranges in df1 to filter dd2 without hardcoding dates manually.此代码有效，但我需要在 df1 中传递开始和结束日期范围以过滤 dd2，而无需手动硬编码日期。

dd2 = dd2[
    (dd2['date'] >= '2018-06-25') & (dd2['date'] <= '2018-06-29') |
    (dd2['date'] >= '2019-05-06') & (dd2['date'] <= '2019-05-13')
]

Answer 1

This looks like it might work:这看起来可能有效：

from itertools import starmap

date_ddf = ddf.set_index("date")
slices = starmap(slice, df.values)

# There might be a more "Pandas-esque" way to do this, but I 
# don't know it yet.
sliced = map(date_ddf.__getitem__, slices)

# We have to reify the `map` object into a `list` for Dask.
concat_ddf = dd.concat(list(sliced))

concat_ddf.compute()

Each pass through the map on date_ddf.__getitem__ returns you a cut of the original frame hence the need for the dd.concat to bring it back together.每次通过map date_ddf.__getitem__上的 map 都会返回原始帧的剪辑，因此需要dd.concat将其重新组合在一起。

Answer 2

Here is another approach, but using list comprehension to slice by the index, and with verification (at the end) that the slicing is done correctly.这是另一种方法，但使用列表推导按索引进行切片，并验证（最后）切片是否正确完成。

Imports进口

from datetime import datetime

import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask import compute

Specify adjustable inputs指定可调输入

# Start date from which to create dummy data to use
data_start_date = "1700-01-01"
# Frequency of dummy data created (hourly)
data_freq = "H"
# number of rows of data to generate
nrows = 3_000_000
# Dask DataFrame chunk size; will be used later to determine how many files
# (of the dummy data generated here) will be exported to disk
chunksize = 75_000

Generate df1 with slicing boundary dates生成带有切片边界日期的df1

df1 = pd.DataFrame.from_records(
    [
        {"start": datetime(1850, 1, 6, 0, 0, 0), "end": datetime(1870, 9, 4, 23, 0, 0)},
        {"start": datetime(1880, 7, 6, 0, 0, 0), "end": datetime(1895, 4, 9, 23, 0, 0)},
        {"start": datetime(1910, 11, 25, 0, 0, 0), "end": datetime(1915, 5, 5, 23, 0, 0)},
        {"start": datetime(1930, 10, 8, 0, 0, 0), "end": datetime(1940, 2, 8, 23, 0, 0)},
        {"start": datetime(1945, 9, 9, 0, 0, 0), "end": datetime(1950, 1, 3, 23, 0, 0)},
    ]
)
print(df1)
       start                 end
0 1850-01-06 1870-09-04 23:00:00
1 1880-07-06 1895-04-09 23:00:00
2 1910-11-25 1915-05-05 23:00:00
3 1930-10-08 1940-02-08 23:00:00
4 1945-09-09 1950-01-03 23:00:00

Create dummy data创建虚拟数据

we'll assign a column here named wanted with all rows being False wanted将在这里分配一个名为 Want 的列，所有行都为False

df = pd.DataFrame(
    np.random.rand(nrows),
    index=pd.date_range(data_start_date, periods=nrows, freq="h"),
    columns=["value1"],
)
df.index.name = "date"
df["wanted"] = False
print(df.head())
                       value1  wanted
date                                 
1700-01-01 00:00:00  0.504119   False
1700-01-01 01:00:00  0.582796   False
1700-01-01 02:00:00  0.383905   False
1700-01-01 03:00:00  0.995389   False
1700-01-01 04:00:00  0.592130   False

Now, we'll change the wanted rows to True if the rows have the same dates as those in df1现在，如果行的日期与df1中的日期相同，我们会将所需的行更改为True

the reason for doing this is so that we can check later that our slicing is correct这样做的原因是我们可以稍后检查我们的切片是否正确
this step and the wanted column are not necessary in your real use-case, but are only required to check our work这一步和wanted的列在您的实际用例中不是必需的，但只需要检查我们的工作

for _, row in df1.iterrows():
    df.loc[row['start']: row['end'], "wanted"] = True
df = df.reset_index()
print(df.head())
print(df["wanted"].value_counts().to_frame())
                 date    value1  wanted
0 1700-01-01 00:00:00  0.504119   False
1 1700-01-01 01:00:00  0.582796   False
2 1700-01-01 02:00:00  0.383905   False
3 1700-01-01 03:00:00  0.995389   False
4 1700-01-01 04:00:00  0.592130   False
        wanted
False  2530800
True    469200

Note that calling .value_counts() on the wanted column shows the number of True values in this column that we should expect if we've correctly sliced our data.请注意，在wanted列上调用.value_counts()会显示此列中的True值的数量，如果我们正确地对数据进行切片，我们应该期望这些值。 This was done using data in a pandas.DataFrame , but later we'll do this with this same data in a dask.DataFrame .这是使用pandas.DataFrame中的数据完成的，但稍后我们将在dask.DataFrame中使用相同的数据来完成此操作。

Now, we'll export the data to multiple .parquet files locally现在，我们将数据导出到本地的多个.parquet文件中

often, we'll want to start with the data loaded directly from disk into dask通常，我们希望从直接从磁盘加载到dask的数据开始
to export the data to multiple .parquet flies, we'll convert the pandas.DataFrame to a dask.DataFrame and then set the chunksize parameter which will determine how many files are created ( chunksize rows will be placed in each exported file - source )要将数据导出到多个.parquet苍蝇，我们会将pandas.DataFrame转换为dask.DataFrame ，然后设置将在每个导出文件中放置多少个块大小文件（将创建chunksize个源文件）块chunksize参数

ddf = dd.from_pandas(df, chunksize=chunksize)
ddf.to_parquet("data", engine="auto")

Now load all the .parquet files directly into a single dask.DataFrame and set the date column as the index现在将所有.parquet文件直接加载到单个dask.DataFrame并将date列设置为索引

it is computationally expensive to set the index, but we're only specifying this when reading the file directly into a dask.DataFrame and not changing it after that设置索引的计算成本很高，但我们只是在将文件直接读入dask.DataFrame时才指定它，之后不再更改它

ddf = dd.read_parquet(
    "data/",
    dtype={"value1": "float64"},
    index="date",
    parse_dates=["date"],
)
print(ddf)
Dask DataFrame Structure:
                      value1 wanted
npartitions=40                     
1700-01-01 00:00:00  float64   bool
1708-07-23 00:00:00      ...    ...
...                      ...    ...
2033-09-07 00:00:00      ...    ...
2042-03-28 23:00:00      ...    ...
Dask Name: read-parquet, 40 tasks

Now, we're ready to slice using the dates in df1 .现在，我们准备使用df1中的日期进行切片。 We'll do this with list comprehension to iterate over each row in df1 , use the row to slice the data (in the dask.DataFrame ) and then call dd.concat (as @joebeeson did)我们将使用列表理解来遍历df1中的每一行，使用该行对数据进行切片（在dask.DataFrame中），然后调用dd.concat （就像@joebeeson 所做的那样）

slices = dd.concat([ddf.loc[row['start']: row['end']] for _, row in df1.iterrows()])

Finally, compute on this list of delayed dask objects to get a single pandas.DataFrame sliced to give the required dates最后，在此延迟dask对象列表上进行计算，以获得单个pandas.DataFrame切片以提供所需的日期

ddf_sliced_computed = compute(slices)[0].reset_index()
print(ddf_sliced_computed.head())
print(ddf_sliced_computed["wanted"].value_counts().to_frame())
                 date    value1  wanted
0 1850-01-06 00:00:00  0.671781    True
1 1850-01-06 01:00:00  0.455022    True
2 1850-01-06 02:00:00  0.490212    True
3 1850-01-06 03:00:00  0.240171    True
4 1850-01-06 04:00:00  0.162088    True
      wanted
True  469200

As you can see, we've sliced out rows with the correct number of True values in the wanted column.如您所见，我们已经在wanted列中切出了具有正确数量的True值的行。 We can explicitly verify this using the pandas.DataFrame that we used earlier to generate the dummy data that was later written to disk我们可以使用pandas.DataFrame明确验证这一点，我们之前使用它来生成稍后写入磁盘的虚拟数据

assert all(ddf_sliced_computed["wanted"] == True)
assert (
    df[df["wanted"] == True]
    .reset_index(drop=True)
    .equals(ddf_sliced_computed[ddf_sliced_computed["wanted"] == True])
)

Notes笔记

This uses 3M rows.这使用 3M 行。 You are working with 30M rows, so you'll have to modify the dummy data generated at the start if you want to check timings, etc.您正在使用 30M 行，因此如果您想检查时间等，则必须修改开始时生成的虚拟数据。

根据另一个 dataframe 的日期范围对多个 Dask dataframe 进行切片的最快方法

问题描述

2 个解决方案

解决方案1
1 2020-07-06 15:08:55

解决方案2
0 2021-07-09 04:44:05

根据另一个 dataframe 的日期范围对多个 Dask dataframe 进行切片的最快方法

问题描述

2 个解决方案

解决方案1 1 2020-07-06 15:08:55

解决方案2 0 2021-07-09 04:44:05

解决方案1
1 2020-07-06 15:08:55

解决方案2
0 2021-07-09 04:44:05