[英]Apply Function to Groups of Dask DataFrame
I have a huge CSV-File which I initially converted into a Parquet-File.我有一个巨大的 CSV 文件,我最初将其转换为 Parquet 文件。 This File contains Information from different sensors.
该文件包含来自不同传感器的信息。
| | Unnamed: 0 | sensor_id | timestamp | P1 | P2 |
|---:|-------------:|------------:|:--------------------|------:|-----:|
| 0 | 0 | 4224 | 2020-05-01T00:00:00 | 0.5 | 0.5 |
| 1 | 1 | 3016 | 2020-05-01T00:00:00 | 0.77 | 0.7 |
| 2 | 2 | 29570 | 2020-05-01T00:00:00 | 0.82 | 0.52 |
In order to process the data I want to create several smaller (using resampling etc.) DataFrames containing the timeseries of each sensor.为了处理数据,我想创建几个较小的(使用重采样等)数据帧,其中包含每个传感器的时间序列。 These timeseries should then be inserted into a HDF5-File.
然后应将这些时间序列插入 HDF5 文件中。
Is there any faster other possibility besides looping over every group:除了遍历每个组之外,还有其他更快的可能性吗:
import dask.dataframe as dd
import numpy as np
def parse(d):
# ... parsing
return d
# load data
data = dd.read_parquet(fp)
sensor_ids = np.unique(test['sensor_id'].values).compute() # get array of all ids/groups
groups = test.groupby('sensor_id')
res = []
for idx in sensor_ids:
d = parse(groups.get_group(idx).compute())
res.append(d)
# ... loop over res ... store ...
I was thinking about using data.groupby('sensor_id').apply(....)
but this results in a single DataFrame.我正在考虑使用
data.groupby('sensor_id').apply(....)
但这会导致单个 DataFrame。 While the solution above calls the compute()
-method in every iteration leading to a too high computation time.虽然上面的解决方案在每次迭代中调用
compute()
- 方法导致计算时间太长。 The data contains a total of approx.数据总共包含约。
200_000_000
rows. 200_000_000
行。 There is a total of approx 11_000
sensors/groups.总共有大约
11_000
传感器/组。
Can I implemented writing the timeseries to a HDF5-File for every sensor into a function and call apply
?我可以实现将每个传感器的时间序列写入 HDF5 文件到 function 并调用
apply
吗?
The desired result for one group/sensor looks like this:一组/传感器的预期结果如下所示:
parse(data.groupby('sensor_id').get_group(4224).compute()).to_markdown()
| timestamp | sensor_id | P1 | P2 |
|:--------------------|------------:|--------:|--------:|
| 2020-05-01 00:00:00 | 4224 | 2.75623 | 1.08645 |
| 2020-05-02 00:00:00 | 4224 | 5.69782 | 3.21847 |
Here looping is not the best way if you are happy to save the small datasets as parquet you could just use the option partition_on
.如果您愿意将小数据集保存为镶木地板,那么循环不是最好的方法,您可以使用选项
partition_on
。
import dask.dataframe as dd
data = dd.read_parquet(fp)
data.to_parquet("data_partitioned", partition_on="sensor_id")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.