[英]Dask Concatenate a Series of Dataframes
我有一個 Pandas 數據幀的 Dask 系列。 我想使用dask.dataframe.multi.concat
將其轉換為 Dask DataFrame。 但是dask.dataframe.multi.concat
始終需要數據幀列表。
我可以對 Dask 系列的 Pandas DataFrames 執行compute
,以獲得 Pandas 系列的 DataFrames,此時我可以將其轉換為列表。 但我認為最好不要調用compute
,而是直接從Pandas DataFrames的Dask系列中獲取Dask DataFrame。
最好的方法是什么? 這是我產生一系列數據幀的代碼
import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools
def apportion_pcts(pcts, total):
"""Apportion an integer by percentages
Uses the largest remainder method
"""
if (sum(pcts) != 100):
raise ValueError('Percentages must add up to 100')
proportions = [total * (pct / 100) for pct in pcts]
apportions = [math.floor(p) for p in proportions]
remainder = total - sum(apportions)
remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
remainders.sort(key=operator.itemgetter(1), reverse=True)
for (i, _) in itertools.cycle(remainders):
if remainder == 0:
break
else:
apportions[i] += 1
remainder -= 1
return apportions
# images_df = dd.read_csv('./tests/data/classification/images.csv')
images_df = pd.DataFrame({"image_id": [0,1,2,3,4,5], "image_class_id": [0,1,1,3,3,5]})
images_df = dd.from_pandas(images_df, npartitions=1)
output_ratio = [80, 20]
def partition_class (partition):
size = len(partition)
proportions = apportion_pcts(output_ratio, size)
slices = []
start = 0
for proportion in proportions:
s = slice(start, start + proportion)
slices.append(partition.iloc[s, :])
start = start+proportion
slicess = pd.Series(slices)
return slicess
partitioned_schema = dd.utils.make_meta(
[(0, object), (1, object)], pd.Index([], name='image_class_id'))
partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(partition_class, meta=partitioned_schema)
在partitioned_df
中,我們可以通過partitioned_df[0]
或partitioned_df[1]
獲得一系列 dataframe 對象。
以下是 CSV 文件的示例:
image_id,image_width,image_height,image_path,image_class_id
0,224,224,tmp/data/image_matrices/0.npy,5
1,224,224,tmp/data/image_matrices/1.npy,0
2,224,224,tmp/data/image_matrices/2.npy,4
3,224,224,tmp/data/image_matrices/3.npy,1
4,224,224,tmp/data/image_matrices/4.npy,9
5,224,224,tmp/data/image_matrices/5.npy,2
6,224,224,tmp/data/image_matrices/6.npy,1
7,224,224,tmp/data/image_matrices/7.npy,3
8,224,224,tmp/data/image_matrices/8.npy,1
9,224,224,tmp/data/image_matrices/9.npy,4
之后我嘗試減少,但由於代理foo
字符串,這不太有意義。
def zip_partitions(s):
r = []
for c in s.columns:
l = s[c].tolist()
r.append(pd.concat(l))
return pd.Series(r)
output_df = partitioned_df.reduction(
chunk=zip_partitions
)
我試圖連接的代理列表是['foo', 'foo']
。 這個階段有什么用? 發現如何完成任務? 但隨后某些操作不起作用。 我想知道是否是因為我正在操作我得到這些字符串的對象。
我通過在最后應用減少將每個 dataframe 壓縮到一系列數據幀中找到了答案。
import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools
def apportion_pcts(pcts, total):
"""Apportion an integer by percentages
Uses the largest remainder method
"""
if (sum(pcts) != 100):
raise ValueError('Percentages must add up to 100')
proportions = [total * (pct / 100) for pct in pcts]
apportions = [math.floor(p) for p in proportions]
remainder = total - sum(apportions)
remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
remainders.sort(key=operator.itemgetter(1), reverse=True)
for (i, _) in itertools.cycle(remainders):
if remainder == 0:
break
else:
apportions[i] += 1
remainder -= 1
return apportions
images_df = dd.read_csv('./tests/data/classification/images.csv', blocksize=1024)
output_ratio = [80, 20]
def partition_class(group_df, ratio):
proportions = apportion_pcts(ratio, len(group_df))
partitions = []
start = 0
for proportion in proportions:
s = slice(start, start + proportion)
partitions.append(group_df.iloc[s, :])
start += proportion
return pd.Series(partitions)
partitioned_schema = dd.utils.make_meta(
[(i, object) for i in range(len(output_ratio))],
pd.Index([], name='image_class_id'))
partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(
partition_class, meta=partitioned_schema, ratio=output_ratio)
def zip_partitions(partitions_df):
partitions = []
for i in partitions_df.columns:
partitions.append(pd.concat(partitions_df[i].tolist()))
return pd.Series(partitions)
zipped_schema = dd.utils.make_meta((None, object))
partitioned_ds = partitioned_df.reduction(
chunk=zip_partitions, meta=zipped_schema)
我認為應該可以將歸約結合起來並應用於單個自定義聚合以表示 map 歸約操作。
但是我無法弄清楚如何使用自定義聚合來做這樣的事情,因為它使用了一系列 groupby。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.