Calculate xarray dataarray from coordinate labels

Question

I have an DataArray with two variables (meteorological data) over time,y,x coordinates. The x and y coordinates are in a projected coordinate system (EPSG:3035) and aligned so that each cell covers pretty much exactly a standard cell of the 1km LAEA reference grid

I want to prepare the data for further use in Pandas and/or database tables, so I want to add the LAEA Gridcell Number/Label which can be calculated from x and y directly via the following (pseudo) function

def func(cell):
    return r'1km{}{}'.format(int(cell['y']/1000), int(cell['x']/1000))      # e.g. 1kmN2782E4850

But as far as I can see there seems to be no possibility, to apply this function to a DataArray or DataSet in a way so that I have access to these coordinate variables (at least .apply_ufunc() wasn't really working for me.

I am able to calc this on Pandas later on, but some of my datasets consists of 60 up to 120 Mio. Cells/Rows/datasets and pandas (even with Numba) seems to have troubles with that amount. On the xarray I am able to process this on 32 Cores via Dask.

I would be grateful on any advice on how to get this working.

EDIT: Some more insights of the data I`m working with:

This one is quite the largest with 500 Mio cells, but I am able to downsample this to squarekilometer resolution which ends up with about 160 Mio. cells

If the dataset is small enough, I am able to export it as a pandas dataframe and calculate there, but thats slow and not very robust as the kernel is crashing quite often

Answer 1

This is how you can apply your function:

import xarray as xr

# ufunc
def func(x, y):
    #print(y)
     return r'1km{}{}'.format(int(y), int(x))

# test data
ds = xr.tutorial.load_dataset("rasm")

xr.apply_ufunc(
    func, 
    ds.x,
    ds.y,
    vectorize=True,
)

Note that you don't have to list input_core_dims in your case.

Also, since your function isn't vectorized, you need to set vectorized=True :

vectorize: bool, optional If True, then assume func only takes arrays defined over core dimensions as input and vectorize it automatically with:py:func: numpy.vectorize . This option exists for convenience, but is almost always slower than supplying a pre-vectorized function. Using this option requires NumPy version 1.12 or newer.

Using vectorized might not be the most performant option as it is essentially just looping, but if you have your data in chunks and use dask , it might be good enough.

If not, you could look into creating a vectorized function with eg numba that would speed things up surely.

More info can be found in the xarray tutorial on applying ufuncs

Answer 2

You can use apply_ufunc in an unvectorised way :

def func(x, y):
    return f'1km{int(y/1000)}{int(x/1000)}'  # e.g. 1kmN2782E4850

xr.apply_ufunc(
    func, # first the function
    x.x,  # now arguments in the order expected by 'func'
    x.y
    )

Calculate xarray dataarray from coordinate labels

Question

2 answers

solution1
2 ACCPTED 2021-03-29 11:00:25

solution2
1 2021-03-29 09:29:31

Calculate xarray dataarray from coordinate labels

Question

2 answers

solution1 2 ACCPTED 2021-03-29 11:00:25

solution2 1 2021-03-29 09:29:31

solution1
2 ACCPTED 2021-03-29 11:00:25

solution2
1 2021-03-29 09:29:31