[英]Dask Concatenate a Series of Dataframes
I have a Dask Series of Pandas DataFrames.我有一个 Pandas 数据帧的 Dask 系列。 I would like to use dask.dataframe.multi.concat
to convert this into a Dask DataFrame.我想使用dask.dataframe.multi.concat
将其转换为 Dask DataFrame。 However the dask.dataframe.multi.concat
always requires a list of DataFrames.但是dask.dataframe.multi.concat
始终需要数据帧列表。
I could perform a compute
on the Dask series of Pandas DataFrames to get a Pandas series of DataFrames, at which point I could turn that into a list.我可以对 Dask 系列的 Pandas DataFrames 执行compute
,以获得 Pandas 系列的 DataFrames,此时我可以将其转换为列表。 But I think it would be better not to call compute
and instead directly acquire the Dask DataFrame from the Dask Series of Pandas DataFrames.但我认为最好不要调用compute
,而是直接从Pandas DataFrames的Dask系列中获取Dask DataFrame。
What would the best way to do this?最好的方法是什么? Here's my code that produces the series of dataframes这是我产生一系列数据帧的代码
import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools
def apportion_pcts(pcts, total):
"""Apportion an integer by percentages
Uses the largest remainder method
"""
if (sum(pcts) != 100):
raise ValueError('Percentages must add up to 100')
proportions = [total * (pct / 100) for pct in pcts]
apportions = [math.floor(p) for p in proportions]
remainder = total - sum(apportions)
remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
remainders.sort(key=operator.itemgetter(1), reverse=True)
for (i, _) in itertools.cycle(remainders):
if remainder == 0:
break
else:
apportions[i] += 1
remainder -= 1
return apportions
# images_df = dd.read_csv('./tests/data/classification/images.csv')
images_df = pd.DataFrame({"image_id": [0,1,2,3,4,5], "image_class_id": [0,1,1,3,3,5]})
images_df = dd.from_pandas(images_df, npartitions=1)
output_ratio = [80, 20]
def partition_class (partition):
size = len(partition)
proportions = apportion_pcts(output_ratio, size)
slices = []
start = 0
for proportion in proportions:
s = slice(start, start + proportion)
slices.append(partition.iloc[s, :])
start = start+proportion
slicess = pd.Series(slices)
return slicess
partitioned_schema = dd.utils.make_meta(
[(0, object), (1, object)], pd.Index([], name='image_class_id'))
partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(partition_class, meta=partitioned_schema)
In the partitioned_df
, we can get partitioned_df[0]
or partitioned_df[1]
to get a series of dataframe objects.在partitioned_df
中,我们可以通过partitioned_df[0]
或partitioned_df[1]
获得一系列 dataframe 对象。
Here is an example of the CSV file:以下是 CSV 文件的示例:
image_id,image_width,image_height,image_path,image_class_id
0,224,224,tmp/data/image_matrices/0.npy,5
1,224,224,tmp/data/image_matrices/1.npy,0
2,224,224,tmp/data/image_matrices/2.npy,4
3,224,224,tmp/data/image_matrices/3.npy,1
4,224,224,tmp/data/image_matrices/4.npy,9
5,224,224,tmp/data/image_matrices/5.npy,2
6,224,224,tmp/data/image_matrices/6.npy,1
7,224,224,tmp/data/image_matrices/7.npy,3
8,224,224,tmp/data/image_matrices/8.npy,1
9,224,224,tmp/data/image_matrices/9.npy,4
I tried to do a reduction afterwards, but this doesn't quite make sense due to a proxy foo
string.之后我尝试减少,但由于代理foo
字符串,这不太有意义。
def zip_partitions(s):
r = []
for c in s.columns:
l = s[c].tolist()
r.append(pd.concat(l))
return pd.Series(r)
output_df = partitioned_df.reduction(
chunk=zip_partitions
)
The proxy list that I'm attempting to concat is ['foo', 'foo']
.我试图连接的代理列表是['foo', 'foo']
。 What is this phase for?这个阶段有什么用? To discover how to do the task?发现如何完成任务? But then certain operations don't work.但随后某些操作不起作用。 I'm wondering if it's because I'm operating over objects that I'm getting these strings.我想知道是否是因为我正在操作我得到这些字符串的对象。
I figured out an answer by applying the reduction at the very end to "zip" up each dataframe into a series of dataframes.我通过在最后应用减少将每个 dataframe 压缩到一系列数据帧中找到了答案。
import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools
def apportion_pcts(pcts, total):
"""Apportion an integer by percentages
Uses the largest remainder method
"""
if (sum(pcts) != 100):
raise ValueError('Percentages must add up to 100')
proportions = [total * (pct / 100) for pct in pcts]
apportions = [math.floor(p) for p in proportions]
remainder = total - sum(apportions)
remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
remainders.sort(key=operator.itemgetter(1), reverse=True)
for (i, _) in itertools.cycle(remainders):
if remainder == 0:
break
else:
apportions[i] += 1
remainder -= 1
return apportions
images_df = dd.read_csv('./tests/data/classification/images.csv', blocksize=1024)
output_ratio = [80, 20]
def partition_class(group_df, ratio):
proportions = apportion_pcts(ratio, len(group_df))
partitions = []
start = 0
for proportion in proportions:
s = slice(start, start + proportion)
partitions.append(group_df.iloc[s, :])
start += proportion
return pd.Series(partitions)
partitioned_schema = dd.utils.make_meta(
[(i, object) for i in range(len(output_ratio))],
pd.Index([], name='image_class_id'))
partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(
partition_class, meta=partitioned_schema, ratio=output_ratio)
def zip_partitions(partitions_df):
partitions = []
for i in partitions_df.columns:
partitions.append(pd.concat(partitions_df[i].tolist()))
return pd.Series(partitions)
zipped_schema = dd.utils.make_meta((None, object))
partitioned_ds = partitioned_df.reduction(
chunk=zip_partitions, meta=zipped_schema)
I think it should be possible to combine both the reduction and apply to a single custom aggregation to represent a map reduce operation.我认为应该可以将归约结合起来并应用于单个自定义聚合以表示 map 归约操作。
However I could not figure out how to do such a thing with the custom aggregation since it uses a series groupby.但是我无法弄清楚如何使用自定义聚合来做这样的事情,因为它使用了一系列 groupby。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.