简体   繁体   中英

How do I insert a numpy ndarray slice as a new Dask DataFrame column?

I'm trying to use code (provided at the below link) to map Lat/Long coordinates to NYC boroughs:

https://www.kaggle.com/muonneutrino/nyc-taxis-eda-and-mapping-position-to-borough

I'm working on a low memory local Jupyter environment so I've imported the large .csv file with Taxi lat/long data into a dask dataframe.

First, I create a dask dataframe with June 2016 Yellow Cab data found here : and subset to a test_day to make the set smaller:

import pandas as pd
import dask.dataframe as dd
import dask.array as da

from dask.distributed import Client
client = Client(processes=False)
%pylab inline

cols= ['pickup_longitude', 'pickup_latitude', 'tpep_pickup_datetime',]
ddf = dd.read_csv('yellow_tripdata_2016-06.csv',blocksize=13e7,assume_missing=True, usecols=cols)
ddf['tpep_pickup_datetime'] = dd.to_datetime(ddf.tpep_pickup_datetime, errors='ignore')
ddf['pickup_day'] = ddf.tpep_pickup_datetime.dt.day
td = ddf.loc[ddf.pickup_day == 10]
td = td.rename(columns={'pickup_longitude':'plon',
                    'pickup_latitude':'plat'} )

I start by declaring values latmin, lonmin, latmax, and lonmax and creating the numpy array map_tracts:

xmin = 40.48
ymin = -74.28
xmax = 40.93
ymax = -73.65
dlat = (xmax-xmin) / 199
dlon = (ymax-ymin) / 199
td['lat_idx'] = (np.rint((td['plat'] - latmin) / dlat))
td['lon_idx'] = (np.rint((td['plon'] - lonmin) / dlon ))  
map_tracts = ([[34023007600, 34023007600, 34023007500, 34031246300,
        34031246300, 34031246300],
       [34023007600, 34023007600, 34023007600, 34031246300,
        34031246300, 34031246300],
       [34023007600, 34023007600, 34023007600, 34031246300,
        34031246300, 34031246300],
       [          0,           0,           0, 36059990200,
        36119007600, 36119007600],
       [          0,           0,           0, 36059990200,
        36059990200, 36119007600]])

I then try to run a dask array where clause:

td['pu_tracts'] = da.where(((xmin < td.plat < xmax) & 
                            (ymin < td.plong < ymin)),
                            (map_tracts[td.lat_idx, td.lon_idx]),0)

But recieve an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-5228e3ec653a> in <module>
----> 1 td['pu_tracts'] = np.where(((xmin < td.plat < xmax) & 
      2                                  (ymin < td.plong < ymin)),
      3                                  (map_tracts[td_day.lat_idx, td.lon_idx]),0)

~/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py in __bool__(self)
    441         raise ValueError("The truth value of a {0} is ambiguous. "
    442                          "Use a.any() or a.all()."
--> 443                          .format(self.__class__.__name__))
    444 
    445     __nonzero__ = __bool__  # python 2

ValueError: The truth value of a Series is ambiguous. Use a.any() or a.all().

Is this a dask issue?

UPDATE: after much to-and-fro on OP's code and MCVE, turns out map_tracts[lon_idx,lat_idx] wasn't even a function, but either a dask.DataFrame or maybe an np.ndarray (OP: which is it?! Just show us type(map_tracts[lon_idx,lat_idx]) already please.)

UPDATE2: map_tracts[lon_idx,lat_idx] isn't even a dask.DataFrame/Series either, it's a single (numpy) value obtained from slicing into map_tracts (a numpy.ndarray), then OP builds a np.ndarray from a list comprehension of these.

If you want to return a numpy array to a dask DataFrame, you may need to wrap it as another dask.DataFrame (see the dask doc for that) containing a single series.


I haven't used dask but a quick Google with your exception finds the following dask known-issue on github (closed, wont-fix):

#4429: Join dask.DataFrame with dask.Series "Could someone please let me know how to join a dask dataframe with a dask series object."

which was closed (wont-fix, presumably) with the recommendation "Try the to_frame method" .

Your function get_tract in turn calls map_tracts which you haven't given code for (is that a third-party library? numpy call? some code of your own you haven't shown?) And crucially we can't see whether its return type is dask.Series , dask.DataFrame , numpy.ndarray , pandas.Series , base Python list etc. That matters.

Solution: assuming map_tracts() returns a dask.Series , you probably need to wrap it by calling dask.Series._to_frame()

The dask attitude that they will never-fix these and not even leave them open for future version consideration sounds pretty weak and you should leave a comment on the issue, try to reopen (include a link to this SO question), and I suggest also open a dask docbug on them, at minimum their doc needs to show a code-sample of how to do it right; merging a column is fairly basic stuff.

(To be honest, Databricks recently (4/2019) launched koalas as a drop-in Spark replacement for pandas , so I expect a subset of performance-critical Python/pandas users who switched to dask may migrate to Spark/koalas.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM