[英]Add random numbers to dask dataframe using data from the dataframe to set limits
I would like to add random numbers to a dask dataframe that uses a column intensity
of the original dataframe to set the limits of the random numbers for each row.我想将随机数添加到 dask dataframe,它使用原始 dataframe 的列intensity
来设置每一行的随机数限制。 The code works with pandas
and numpy.random
, but not with dask
and dask.array
.该代码适用于pandas
和numpy.random
,但不适用于dask
和dask.array
。
import dask.array as da
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
fns = [list-of-filenames]
df = dd.read_parquet(fns)
# dataframe has a column called intensity of type float
# and no missing values
df['separation_dimension_1'] = da.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
The error is:错误是:
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (0,) and arg 1 with shape (33276691,).
Seems the syntax of numpy.random.uniform is a bit different than dask_array.random.uniform?似乎 numpy.random.uniform 的语法与 dask_array.random.uniform 有点不同?
Cell In[21], line 7
5 df['mz_'] = df.mz * 1000000000
6 df['rt_'] = df.scan_time*10
----> 7 df['separation_dimension_1'] = da.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
8 #df['separation_dimension_2'] = da.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
9 #df['separation_dimension_3'] = da.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
11 df = df[df.intensity > 1e5][['rt_', 'mz_', 'logint']]
File ~/miniconda3/envs/dask/lib/python3.9/site-packages/dask/array/random.py:465, in _make_api.<locals>.wrapper(*args, **kwargs)
462 if backend not in _cached_random_states:
463 # Cache the default RandomState object for this backend
464 _cached_random_states[backend] = RandomState()
--> 465 return getattr(
466 _cached_random_states[backend],
467 attr,
468 )(*args, **kwargs)
File ~/miniconda3/envs/dask/lib/python3.9/site-packages/dask/array/random.py:423, in RandomState.uniform(self, low, high, size, chunks, **kwargs)
421 @derived_from(np.random.RandomState, skipblocks=1)
422 def uniform(self, low=0.0, high=1.0, size=None, chunks="auto", **kwargs):
--> 423 return self._wrap("uniform", low, high, size=size, chunks=chunks, **kwargs)
File ~/miniconda3/envs/dask/lib/python3.9/site-packages/dask/array/random.py:170, in RandomState._wrap(self, funcname, size, chunks, extra_chunks, *args, **kwargs)
165 kwrg[k] = (getitem, lookup[k], slc)
166 vals.append(
167 (_apply_random, self._RandomState, funcname, seed, size, arg, kwrg)
168 )
--> 170 meta = _apply_random(
171 self._RandomState,
172 funcname,
173 seed,
174 (0,) * len(size),
175 small_args,
176 small_kwargs,
177 )
179 dsk.update(dict(zip(keys, vals)))
181 graph = HighLevelGraph.from_collections(name, dsk, dependencies=dependencies)
File ~/miniconda3/envs/dask/lib/python3.9/site-packages/dask/array/random.py:453, in _apply_random(RandomState, funcname, state_data, size, args, kwargs)
451 state = RandomState(state_data)
452 func = getattr(state, funcname)
--> 453 return func(*args, size=size, **kwargs)
File mtrand.pyx:1134, in numpy.random.mtrand.RandomState.uniform()
File _common.pyx:600, in numpy.random._common.cont()
File _common.pyx:517, in numpy.random._common.cont_broadcast_2()
File __init__.pxd:741, in numpy.PyArray_MultiIterNew3()
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (0,) and arg 1 with shape (6249365,).
As is often the case, you will be able to do this using map_partitions, which applies the operation you are after on each component real pandas dataframe通常情况下,您将能够使用 map_partitions 执行此操作,它将您之后的操作应用于每个组件 real pandas dataframe
def op(df):
df['separation_dimension_1'] = np.random.uniform(size=N, low=-noise_level/df.intensity, high=noise_level/df.intensity)
return df
df2 = df.map_partitions(op)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.