简体   繁体   中英

How to get a single row value of dataframe reading from parquet files in a Dask?

Problem: The DASK dataframe

loc[concrete_row, concrete_column] 

return pandas data frame with multiple rows, each with the same index:

0                   [1,2,3]
0                   [1,2]
0                   [3]

instead of one row value.

0                   [1,2,3]

I am reading many parquet files:

dd.read_parquet(dataset_dir+'/train/date*/*.parquet')

Each row in parquet file has a array!!!

  • It seems that when calling concrete row the dask dataframe return all partitions values with this row index of each partition.
  • When read from parquet files all divisions is none
  • I try to set_index and set divisions but its become too slow

I need to call map function for each row and get iterable values of this concrete row. How to i resolve it?

I need to call map function for each row and get iterable values of this concrete row.

It sounds like you want the map or apply methods.

def func(row):
    return ...

result = df.apply(func)

In general parallel computing tools like Dask are poorly suited to get data one row at a time. Instead it's common to apply a function across all of your rows in parallel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM