简体   繁体   中英

Seeking explanation to Dask vs Numpy vs Pandas benchmark results

I am trying to benchmark the performance of dask vs pandas .

def make_pandas(n):
    df = pd.DataFrame(np.random.randint(10, size=(n, 3)))
    return df

def make_dask(n):
    df = da.from_array(np.random.randint(10, size=(n, 3)), chunks=10)
    return df

def make_numpy(n):
    return np.random.randint(10, size=(n, 3))

def sum_pandas(x): return x[0].sum()
def sum_dask(x): return x[1].sum()
def sum_numpy(x): return x[2].sum()

perfplot.show(
    setup=lambda n: [make_pandas(n), make_dask(n), make_numpy(n)],
    kernels=[sum_pandas, sum_dask, sum_numpy],
    n_range=[2**k for k in range(2, 15)],
    equality_check=False,
    xlabel='len(df)')

Can someone explain the results:

dask v numpy v pandas-10个dask块

Increasing the chunks to 1000, 8000 and 10000 gives these respectively:

dask v numpy v pandas-1000个dask块

dask v numpy v pandas-8000 dask块

dask v numpy v pandas-10000个dask块

  • Processor: Intel® Core™ i5-7300HQ CPU @ 2.50GHz × 4
  • Memory: 7.7 GiB
  • Python: 3.5.2
  • pandas: 0.21.0
  • numpy: 1.13.1
  • dask: 0.19.0

Isn't dask supposed to parallelize and be better as size increases?

chunks关键字是chunksize的缩写,不是chunk的数量

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM