简体   繁体   中英

Using dask to return more than one dataframe

I am using read_csv() to read a long list of csv files and return two dataframes. I have managed to speed up this action by using dask. Unfortunately, I have not been able to return multiple variables when using dask.

The minimum working example below replicates my issue:

@delayed(nout = 2)
def function(a):
  d = 0
  c = a + a
  if a>4: # random condition to make c and d of different lenghts
    d = a * a
  return pd.DataFrame([c])#, pd.DataFrame([d])

list = [1,2,3,4,5]

dfs = [delayed(function)(int) for int in list]
ddf = dd.from_delayed(dfs)
ddf.compute()

Any ideas to resolve this issue is appreciated. Thanks.

The delayed decorator has nout parameter, so something like this might work:

from dask import delayed

@delayed(nout=2)
def function(a,b):
  c = a + b
  d = a * b
  return c, d

delayed_c, delayed_d = function(2, 3)

From the question it's not clear at which step dataframes come in, but the key part of the question (returning more than one value from dask delayed) is answered by using nout , see this answer for full details.

Update:

The delayed function in the updated question returns a tuple of dataframes, this means that dd.from_delayed should be called either on each element of the tuple or the tuple should be unpacked:

dfs = [delayed_value for int in list for delayed_value in function(int)]
ddf = dd.from_delayed(dfs)
ddf.compute()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM