简体   繁体   English

Dask 连接一系列数据帧

[英]Dask Concatenate a Series of Dataframes

I have a Dask Series of Pandas DataFrames.我有一个 Pandas 数据帧的 Dask 系列。 I would like to use dask.dataframe.multi.concat to convert this into a Dask DataFrame.我想使用dask.dataframe.multi.concat将其转换为 Dask DataFrame。 However the dask.dataframe.multi.concat always requires a list of DataFrames.但是dask.dataframe.multi.concat始终需要数据帧列表。

I could perform a compute on the Dask series of Pandas DataFrames to get a Pandas series of DataFrames, at which point I could turn that into a list.我可以对 Dask 系列的 Pandas DataFrames 执行compute ,以获得 Pandas 系列的 DataFrames,此时我可以将其转换为列表。 But I think it would be better not to call compute and instead directly acquire the Dask DataFrame from the Dask Series of Pandas DataFrames.但我认为最好不要调用compute ,而是直接从Pandas DataFrames的Dask系列中获取Dask DataFrame。

What would the best way to do this?最好的方法是什么? Here's my code that produces the series of dataframes这是我产生一系列数据帧的代码

import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools

def apportion_pcts(pcts, total):
    """Apportion an integer by percentages
    Uses the largest remainder method
    """
    if (sum(pcts) != 100):
        raise ValueError('Percentages must add up to 100')
    proportions = [total * (pct / 100) for pct in pcts]
    apportions = [math.floor(p) for p in proportions]
    remainder = total - sum(apportions)
    remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
    remainders.sort(key=operator.itemgetter(1), reverse=True)
    for (i, _) in itertools.cycle(remainders):
        if remainder == 0:
            break
        else:
            apportions[i] += 1
            remainder -= 1
    return apportions


# images_df = dd.read_csv('./tests/data/classification/images.csv')
images_df = pd.DataFrame({"image_id": [0,1,2,3,4,5], "image_class_id": [0,1,1,3,3,5]})
images_df = dd.from_pandas(images_df, npartitions=1)

output_ratio = [80, 20]

def partition_class (partition):
    size = len(partition)
    proportions = apportion_pcts(output_ratio, size)
    slices = []
    start = 0
    for proportion in proportions:
        s = slice(start, start + proportion)
        slices.append(partition.iloc[s, :])
        start = start+proportion
    slicess = pd.Series(slices)
    return slicess

partitioned_schema = dd.utils.make_meta(
    [(0, object), (1, object)], pd.Index([], name='image_class_id'))
partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(partition_class, meta=partitioned_schema)

In the partitioned_df , we can get partitioned_df[0] or partitioned_df[1] to get a series of dataframe objects.partitioned_df中,我们可以通过partitioned_df[0]partitioned_df[1]获得一系列 dataframe 对象。


Here is an example of the CSV file:以下是 CSV 文件的示例:

image_id,image_width,image_height,image_path,image_class_id
0,224,224,tmp/data/image_matrices/0.npy,5
1,224,224,tmp/data/image_matrices/1.npy,0
2,224,224,tmp/data/image_matrices/2.npy,4
3,224,224,tmp/data/image_matrices/3.npy,1
4,224,224,tmp/data/image_matrices/4.npy,9
5,224,224,tmp/data/image_matrices/5.npy,2
6,224,224,tmp/data/image_matrices/6.npy,1
7,224,224,tmp/data/image_matrices/7.npy,3
8,224,224,tmp/data/image_matrices/8.npy,1
9,224,224,tmp/data/image_matrices/9.npy,4

I tried to do a reduction afterwards, but this doesn't quite make sense due to a proxy foo string.之后我尝试减少,但由于代理foo字符串,这不太有意义。

def zip_partitions(s):
    r = []
    for c in s.columns:
        l = s[c].tolist()
        r.append(pd.concat(l))
    return pd.Series(r)

output_df = partitioned_df.reduction(
    chunk=zip_partitions
)

The proxy list that I'm attempting to concat is ['foo', 'foo'] .我试图连接的代理列表是['foo', 'foo'] What is this phase for?这个阶段有什么用? To discover how to do the task?发现如何完成任务? But then certain operations don't work.但随后某些操作不起作用。 I'm wondering if it's because I'm operating over objects that I'm getting these strings.我想知道是否是因为我正在操作我得到这些字符串的对象。

I figured out an answer by applying the reduction at the very end to "zip" up each dataframe into a series of dataframes.我通过在最后应用减少将每个 dataframe 压缩到一系列数据帧中找到了答案。

import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools


def apportion_pcts(pcts, total):
    """Apportion an integer by percentages
    Uses the largest remainder method
    """
    if (sum(pcts) != 100):
        raise ValueError('Percentages must add up to 100')
    proportions = [total * (pct / 100) for pct in pcts]
    apportions = [math.floor(p) for p in proportions]
    remainder = total - sum(apportions)
    remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
    remainders.sort(key=operator.itemgetter(1), reverse=True)
    for (i, _) in itertools.cycle(remainders):
        if remainder == 0:
            break
        else:
            apportions[i] += 1
            remainder -= 1
    return apportions


images_df = dd.read_csv('./tests/data/classification/images.csv', blocksize=1024)

output_ratio = [80, 20]


def partition_class(group_df, ratio):
    proportions = apportion_pcts(ratio, len(group_df))
    partitions = []
    start = 0
    for proportion in proportions:
        s = slice(start, start + proportion)
        partitions.append(group_df.iloc[s, :])
        start += proportion
    return pd.Series(partitions)


partitioned_schema = dd.utils.make_meta(
    [(i, object) for i in range(len(output_ratio))],
    pd.Index([], name='image_class_id'))

partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(
    partition_class, meta=partitioned_schema, ratio=output_ratio)


def zip_partitions(partitions_df):
    partitions = []
    for i in partitions_df.columns:
        partitions.append(pd.concat(partitions_df[i].tolist()))
    return pd.Series(partitions)


zipped_schema = dd.utils.make_meta((None, object))

partitioned_ds = partitioned_df.reduction(
    chunk=zip_partitions, meta=zipped_schema)

I think it should be possible to combine both the reduction and apply to a single custom aggregation to represent a map reduce operation.我认为应该可以将归约结合起来并应用于单个自定义聚合以表示 map 归约操作。

However I could not figure out how to do such a thing with the custom aggregation since it uses a series groupby.但是我无法弄清楚如何使用自定义聚合来做这样的事情,因为它使用了一系列 groupby。

可视化图

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM