简体   繁体   中英

Dask Concatenate a Series of Dataframes

I have a Dask Series of Pandas DataFrames. I would like to use dask.dataframe.multi.concat to convert this into a Dask DataFrame. However the dask.dataframe.multi.concat always requires a list of DataFrames.

I could perform a compute on the Dask series of Pandas DataFrames to get a Pandas series of DataFrames, at which point I could turn that into a list. But I think it would be better not to call compute and instead directly acquire the Dask DataFrame from the Dask Series of Pandas DataFrames.

What would the best way to do this? Here's my code that produces the series of dataframes

import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools

def apportion_pcts(pcts, total):
    """Apportion an integer by percentages
    Uses the largest remainder method
    """
    if (sum(pcts) != 100):
        raise ValueError('Percentages must add up to 100')
    proportions = [total * (pct / 100) for pct in pcts]
    apportions = [math.floor(p) for p in proportions]
    remainder = total - sum(apportions)
    remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
    remainders.sort(key=operator.itemgetter(1), reverse=True)
    for (i, _) in itertools.cycle(remainders):
        if remainder == 0:
            break
        else:
            apportions[i] += 1
            remainder -= 1
    return apportions


# images_df = dd.read_csv('./tests/data/classification/images.csv')
images_df = pd.DataFrame({"image_id": [0,1,2,3,4,5], "image_class_id": [0,1,1,3,3,5]})
images_df = dd.from_pandas(images_df, npartitions=1)

output_ratio = [80, 20]

def partition_class (partition):
    size = len(partition)
    proportions = apportion_pcts(output_ratio, size)
    slices = []
    start = 0
    for proportion in proportions:
        s = slice(start, start + proportion)
        slices.append(partition.iloc[s, :])
        start = start+proportion
    slicess = pd.Series(slices)
    return slicess

partitioned_schema = dd.utils.make_meta(
    [(0, object), (1, object)], pd.Index([], name='image_class_id'))
partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(partition_class, meta=partitioned_schema)

In the partitioned_df , we can get partitioned_df[0] or partitioned_df[1] to get a series of dataframe objects.


Here is an example of the CSV file:

image_id,image_width,image_height,image_path,image_class_id
0,224,224,tmp/data/image_matrices/0.npy,5
1,224,224,tmp/data/image_matrices/1.npy,0
2,224,224,tmp/data/image_matrices/2.npy,4
3,224,224,tmp/data/image_matrices/3.npy,1
4,224,224,tmp/data/image_matrices/4.npy,9
5,224,224,tmp/data/image_matrices/5.npy,2
6,224,224,tmp/data/image_matrices/6.npy,1
7,224,224,tmp/data/image_matrices/7.npy,3
8,224,224,tmp/data/image_matrices/8.npy,1
9,224,224,tmp/data/image_matrices/9.npy,4

I tried to do a reduction afterwards, but this doesn't quite make sense due to a proxy foo string.

def zip_partitions(s):
    r = []
    for c in s.columns:
        l = s[c].tolist()
        r.append(pd.concat(l))
    return pd.Series(r)

output_df = partitioned_df.reduction(
    chunk=zip_partitions
)

The proxy list that I'm attempting to concat is ['foo', 'foo'] . What is this phase for? To discover how to do the task? But then certain operations don't work. I'm wondering if it's because I'm operating over objects that I'm getting these strings.

I figured out an answer by applying the reduction at the very end to "zip" up each dataframe into a series of dataframes.

import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools


def apportion_pcts(pcts, total):
    """Apportion an integer by percentages
    Uses the largest remainder method
    """
    if (sum(pcts) != 100):
        raise ValueError('Percentages must add up to 100')
    proportions = [total * (pct / 100) for pct in pcts]
    apportions = [math.floor(p) for p in proportions]
    remainder = total - sum(apportions)
    remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
    remainders.sort(key=operator.itemgetter(1), reverse=True)
    for (i, _) in itertools.cycle(remainders):
        if remainder == 0:
            break
        else:
            apportions[i] += 1
            remainder -= 1
    return apportions


images_df = dd.read_csv('./tests/data/classification/images.csv', blocksize=1024)

output_ratio = [80, 20]


def partition_class(group_df, ratio):
    proportions = apportion_pcts(ratio, len(group_df))
    partitions = []
    start = 0
    for proportion in proportions:
        s = slice(start, start + proportion)
        partitions.append(group_df.iloc[s, :])
        start += proportion
    return pd.Series(partitions)


partitioned_schema = dd.utils.make_meta(
    [(i, object) for i in range(len(output_ratio))],
    pd.Index([], name='image_class_id'))

partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(
    partition_class, meta=partitioned_schema, ratio=output_ratio)


def zip_partitions(partitions_df):
    partitions = []
    for i in partitions_df.columns:
        partitions.append(pd.concat(partitions_df[i].tolist()))
    return pd.Series(partitions)


zipped_schema = dd.utils.make_meta((None, object))

partitioned_ds = partitioned_df.reduction(
    chunk=zip_partitions, meta=zipped_schema)

I think it should be possible to combine both the reduction and apply to a single custom aggregation to represent a map reduce operation.

However I could not figure out how to do such a thing with the custom aggregation since it uses a series groupby.

可视化图

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM