How to get a single row value of dataframe reading from parquet files in a Dask?

Question

Problem: The DASK dataframe

loc[concrete_row, concrete_column]

return pandas data frame with multiple rows, each with the same index:

0                   [1,2,3]
0                   [1,2]
0                   [3]

instead of one row value.

0                   [1,2,3]

I am reading many parquet files:

dd.read_parquet(dataset_dir+'/train/date*/*.parquet')

Each row in parquet file has a array!!!

It seems that when calling concrete row the dask dataframe return all partitions values with this row index of each partition.
When read from parquet files all divisions is none
I try to set_index and set divisions but its become too slow

I need to call map function for each row and get iterable values of this concrete row. How to i resolve it?

Answer 1

I need to call map function for each row and get iterable values of this concrete row.

It sounds like you want the map or apply methods.

def func(row):
    return ...

result = df.apply(func)

In general parallel computing tools like Dask are poorly suited to get data one row at a time. Instead it's common to apply a function across all of your rows in parallel.

How to get a single row value of dataframe reading from parquet files in a Dask?

Question

1 answers

solution1
0 2019-06-05 02:59:29

How to get a single row value of dataframe reading from parquet files in a Dask?

Question

1 answers

solution1 0 2019-06-05 02:59:29

solution1
0 2019-06-05 02:59:29