简体   繁体   中英

How to collocate large datasets most efficiently, comparing time, latitude (x), and longitude (y)

I would like some help trying to efficiently collocate two datasets, one is let's say observations of rainfall, in terms of datetime, latitude and longitude. The other is meteorological data eg reanalysis given also in terms of datetime, latitude and longitude. Below I provide two example random df and xarrays and then collocate them.

from numpy.random import rand
from random import randint
from datetime import datetime, timedelta
import xarray as xr
import numpy as np

#create example data of the dataframe we want to collocate with the meterological data

datetimes = pd.date_range(start='2002-01-01 10:00:00', end='2002-01-05 10:00:00', freq='H')
rainfall = rand(len(datetimes))
latitudes = [randint(0, 90) for p in range(0, len(datetimes))]
longitudes = [randint(0, 180) for p in range(0, len(datetimes))]
df_obs = pd.DataFrame({'datetime':datetimes, 'rainfall':rainfall, 'latitude':latitudes,
                       'longitude':longitudes})

#create an xarray which is the example met data

met_type = np.ones((720, 1440))
rainfall = rand(len(datetimes))
met_list = [x*met_type for x in rainfall]

def produce_xarray(met_list, datetimes, met_type='rain', datetime_var="datetime"): [![enter image description here][1]][1]
    if isinstance(datetimes[0], datetime) == False:
        dates = [datetime.strptime(x, '%Y%m') for x in datetimes]
    if isinstance(datetimes[0], datetime) == True:
        dates = datetimes
    met_list_dstack = np.dstack(met_list)
    lats = np.arange(90, -90, -0.25)
    lons = np.arange(-180,180, 0.25)
    ds = xr.Dataset(data_vars={met_type:(["latitude","longitude",datetime_var], met_list_dstack),}, 
                    coords={"latitude": lats, "longitude": lons, datetime_var: dates})
    ds[met_type].attrs["units"] = "g "+str(met_type)+"m$^{-2}$"
    return ds

xr_met = produce_xarray(met_list, datetimes, datetime_var="datetime")

#now I wish to collocate the data as quickly as possible, as my datasets are huge - 
#here I have a function which finds the closest value using the datetime, latitude and longitude 
#the I apply this function to the df of my random observations

var ='rain'

def find_value_lat_lon(lat, lon, traj_datetime):
    array = xr_met[var].sel(latitude=lat, longitude=lon, datetime=traj_datetime, method='nearest').squeeze()
    value = array.values
    return value

def append_var_columnwise(df, var_name):
    df = df.copy()
    df.loc[:, var_name] = df[['latitude', 'longitude', 'datetime']].apply(lambda x: find_value_lat_lon(*x), 
                                                                                      axis=1)
    return df

print(df_obs)

print(xr_met)

df_obs = append_var_columnwise(df_obs, var_name='rain_met')

print(df_obs)

The final output is shown in the picture - whereby the df has an additional column with 'rain met' - for 97 data points this takes 212ms.

I don't know that it is any faster, but .sel supports vectorized indexing (see https://docs.xarray.dev/en/stable/user-guide/indexing.html#vectorized-indexing : the last example in this section is a 2D version of your code)

df.loc[:, var_name] = xr_met[var].sel(
    latitude=xr.DataArray(df['latitude']),
    longitude=xr.DataArray(df['longitude']),
    datetime=xr.DataArray(df['datetime']),
    method='nearest')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM