[英]Fastest way to slice multiple Dask dataframe based on the date ranges from another dataframe
What would be the fastest approach to select only the dates in df1 date ranges from a ddf2 Dask dataframe? select 的最快方法是什么,只有 df1 日期范围内的日期来自 ddf2 Dask dataframe? All dates out of ranges should be dropped.
应删除所有超出范围的日期。
df1 - Pandas dataframe with start end date ranges df1 - Pandas dataframe 与开始结束日期范围
start end
01 2018-06-25 2018-06-29
02 2019-05-06 2019-05-13
...
dd2 - Dask dataframe (30M rows) dd2 - Dask dataframe(30M 行)
(*) marked rows has to be selected (*) 必须选择标记的行
date value1
2018-01-01 23
2018-01-01 24
2018-01-02 545
2018-01-03 433
2018-01-04 23
*2018-06-25 234
*2018-06-25 50
*2018-06-25 120
*2018-06-26 22
*2018-06-27 32
*2018-06-27 123
*2018-06-28 603
*2018-06-29 625
2019-01-01 734
2019-01-01 241
2019-01-01 231
2019-01-02 211
2019-01-02 214
2019-05-05 234
2019-05-05 111
*2019-05-06 846
*2019-05-06 231
*2019-05-07 654
*2019-05-07 119
*2019-05-08 212
*2019-05-08 122
*2019-05-06 765
*2019-05-13 231
*2019-05-13 213
*2019-05-13 443
2019-05-14 321
2019-05-14 231
2019-05-15 123
...
Output: Dask dataframe need it with appended slices Output:Dask dataframe 需要附加切片
date value1
2018-06-25 234
2018-06-25 50
2018-06-25 120
2018-06-26 22
2018-06-27 32
2018-06-27 123
2018-06-28 603
2018-06-29 625
2019-05-06 846
2019-05-06 231
2019-05-07 654
2019-05-07 119
2019-05-08 212
2019-05-08 122
2019-05-06 765
2019-05-13 231
2019-05-13 213
2019-05-13 443
This code is working but I need to pass start & end date ranges in df1 to filter dd2 without hardcoding dates manually.此代码有效,但我需要在 df1 中传递开始和结束日期范围以过滤 dd2,而无需手动硬编码日期。
dd2 = dd2[
(dd2['date'] >= '2018-06-25') & (dd2['date'] <= '2018-06-29') |
(dd2['date'] >= '2019-05-06') & (dd2['date'] <= '2019-05-13')
]
This looks like it might work:这看起来可能有效:
from itertools import starmap
date_ddf = ddf.set_index("date")
slices = starmap(slice, df.values)
# There might be a more "Pandas-esque" way to do this, but I
# don't know it yet.
sliced = map(date_ddf.__getitem__, slices)
# We have to reify the `map` object into a `list` for Dask.
concat_ddf = dd.concat(list(sliced))
concat_ddf.compute()
Each pass through the map
on date_ddf.__getitem__
returns you a cut of the original frame hence the need for the dd.concat
to bring it back together.每次通过
map
date_ddf.__getitem__
上的 map 都会返回原始帧的剪辑,因此需要dd.concat
将其重新组合在一起。
Here is another approach, but using list comprehension to slice by the index, and with verification (at the end) that the slicing is done correctly.这是另一种方法,但使用列表推导按索引进行切片,并验证(最后)切片是否正确完成。
Imports进口
from datetime import datetime
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask import compute
Specify adjustable inputs指定可调输入
# Start date from which to create dummy data to use
data_start_date = "1700-01-01"
# Frequency of dummy data created (hourly)
data_freq = "H"
# number of rows of data to generate
nrows = 3_000_000
# Dask DataFrame chunk size; will be used later to determine how many files
# (of the dummy data generated here) will be exported to disk
chunksize = 75_000
Generate df1
with slicing boundary dates生成带有切片边界日期的
df1
df1 = pd.DataFrame.from_records(
[
{"start": datetime(1850, 1, 6, 0, 0, 0), "end": datetime(1870, 9, 4, 23, 0, 0)},
{"start": datetime(1880, 7, 6, 0, 0, 0), "end": datetime(1895, 4, 9, 23, 0, 0)},
{"start": datetime(1910, 11, 25, 0, 0, 0), "end": datetime(1915, 5, 5, 23, 0, 0)},
{"start": datetime(1930, 10, 8, 0, 0, 0), "end": datetime(1940, 2, 8, 23, 0, 0)},
{"start": datetime(1945, 9, 9, 0, 0, 0), "end": datetime(1950, 1, 3, 23, 0, 0)},
]
)
print(df1)
start end
0 1850-01-06 1870-09-04 23:00:00
1 1880-07-06 1895-04-09 23:00:00
2 1910-11-25 1915-05-05 23:00:00
3 1930-10-08 1940-02-08 23:00:00
4 1945-09-09 1950-01-03 23:00:00
Create dummy data创建虚拟数据
wanted
with all rows being False
wanted
将在这里分配一个名为 Want 的列,所有行都为False
df = pd.DataFrame(
np.random.rand(nrows),
index=pd.date_range(data_start_date, periods=nrows, freq="h"),
columns=["value1"],
)
df.index.name = "date"
df["wanted"] = False
print(df.head())
value1 wanted
date
1700-01-01 00:00:00 0.504119 False
1700-01-01 01:00:00 0.582796 False
1700-01-01 02:00:00 0.383905 False
1700-01-01 03:00:00 0.995389 False
1700-01-01 04:00:00 0.592130 False
Now, we'll change the wanted rows to True
if the rows have the same dates as those in df1
现在,如果行的日期与
df1
中的日期相同,我们会将所需的行更改为True
wanted
column are not necessary in your real use-case, but are only required to check our workwanted
的列在您的实际用例中不是必需的,但只需要检查我们的工作for _, row in df1.iterrows():
df.loc[row['start']: row['end'], "wanted"] = True
df = df.reset_index()
print(df.head())
print(df["wanted"].value_counts().to_frame())
date value1 wanted
0 1700-01-01 00:00:00 0.504119 False
1 1700-01-01 01:00:00 0.582796 False
2 1700-01-01 02:00:00 0.383905 False
3 1700-01-01 03:00:00 0.995389 False
4 1700-01-01 04:00:00 0.592130 False
wanted
False 2530800
True 469200
Note that calling .value_counts()
on the wanted
column shows the number of True
values in this column that we should expect if we've correctly sliced our data.请注意,在
wanted
列上调用.value_counts()
会显示此列中的True
值的数量,如果我们正确地对数据进行切片,我们应该期望这些值。 This was done using data in a pandas.DataFrame
, but later we'll do this with this same data in a dask.DataFrame
.这是使用
pandas.DataFrame
中的数据完成的,但稍后我们将在dask.DataFrame
中使用相同的数据来完成此操作。
Now, we'll export the data to multiple .parquet
files locally现在,我们将数据导出到本地的多个
.parquet
文件中
dask
dask
的数据开始.parquet
flies, we'll convert the pandas.DataFrame
to a dask.DataFrame
and then set the chunksize
parameter which will determine how many files are created ( chunksize
rows will be placed in each exported file - source ).parquet
苍蝇,我们会将pandas.DataFrame
转换为dask.DataFrame
,然后设置将在每个导出文件中放置多少个块大小文件(将创建chunksize
个源文件)块chunksize
参数ddf = dd.from_pandas(df, chunksize=chunksize)
ddf.to_parquet("data", engine="auto")
Now load all the .parquet
files directly into a single dask.DataFrame
and set the date
column as the index现在将所有
.parquet
文件直接加载到单个dask.DataFrame
并将date
列设置为索引
dask.DataFrame
and not changing it after thatdask.DataFrame
时才指定它,之后不再更改它ddf = dd.read_parquet(
"data/",
dtype={"value1": "float64"},
index="date",
parse_dates=["date"],
)
print(ddf)
Dask DataFrame Structure:
value1 wanted
npartitions=40
1700-01-01 00:00:00 float64 bool
1708-07-23 00:00:00 ... ...
... ... ...
2033-09-07 00:00:00 ... ...
2042-03-28 23:00:00 ... ...
Dask Name: read-parquet, 40 tasks
Now, we're ready to slice using the dates in df1
.现在,我们准备使用
df1
中的日期进行切片。 We'll do this with list comprehension to iterate over each row in df1
, use the row to slice the data (in the dask.DataFrame
) and then call dd.concat
(as @joebeeson did)我们将使用列表理解来遍历
df1
中的每一行,使用该行对数据进行切片(在dask.DataFrame
中),然后调用dd.concat
(就像@joebeeson 所做的那样)
slices = dd.concat([ddf.loc[row['start']: row['end']] for _, row in df1.iterrows()])
Finally, compute on this list of delayed dask
objects to get a single pandas.DataFrame
sliced to give the required dates最后,在此延迟
dask
对象列表上进行计算,以获得单个pandas.DataFrame
切片以提供所需的日期
ddf_sliced_computed = compute(slices)[0].reset_index()
print(ddf_sliced_computed.head())
print(ddf_sliced_computed["wanted"].value_counts().to_frame())
date value1 wanted
0 1850-01-06 00:00:00 0.671781 True
1 1850-01-06 01:00:00 0.455022 True
2 1850-01-06 02:00:00 0.490212 True
3 1850-01-06 03:00:00 0.240171 True
4 1850-01-06 04:00:00 0.162088 True
wanted
True 469200
As you can see, we've sliced out rows with the correct number of True
values in the wanted
column.如您所见,我们已经在
wanted
列中切出了具有正确数量的True
值的行。 We can explicitly verify this using the pandas.DataFrame
that we used earlier to generate the dummy data that was later written to disk我们可以使用
pandas.DataFrame
明确验证这一点,我们之前使用它来生成稍后写入磁盘的虚拟数据
assert all(ddf_sliced_computed["wanted"] == True)
assert (
df[df["wanted"] == True]
.reset_index(drop=True)
.equals(ddf_sliced_computed[ddf_sliced_computed["wanted"] == True])
)
Notes笔记
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.