简体   繁体   中英

Efficient method to extract data from netCDF files, with Xarray, into a tall DataFrame

I have a list of about 350 coordinates, which are coordinates within a specified area, that I want to extract from a netCDF file using Xarray. In case it is relevant, I am trying to extract SWE (snow water equivalent) data from a particular land surface model.

My problem is that this for loop takes forever to go through each item in the list and get the relevant timeseries data. Perhaps to some extent this is unavoidable since I am having to actually load the data from the netCDF file for each coodinate. What I need help with is speeding up the code in any way possible. Right now this is taking a very long time to run, 3+ hours and counting to be more precise.

Here is everything I have done so far:

import xarray as xr
import numpy as np
import pandas as pd
from datetime import datetime as dt

1) First, open all of the files (daily data from 1915-2011).

df = xr.open_mfdataset(r'C:\temp\*.nc',combine='by_coords')

2) Narrow my location to a smaller box within the continental United States

swe_sub = df.swe.sel(lon=slice(246.695, 251), lat=slice(33.189, 35.666))

3) I just want to extract the first daily value for each month, which also narrows the timeseries.

swe_first = swe_sub.sel(time=swe_sub.time.dt.day == 1)

Now I want to load up my list list of coordinates (which happens to be in an Excel file).

coord = pd.read_excel(r'C:\Documents\Coordinate_List.xlsx')
print(coord)
lat = coord['Lat']
lon = coord['Lon']
lon = 360+lon
name = coord['OBJECTID']

The following for loop goes through each coordinate in my list of coordinates, extracts the timeseries at each coordinate, and rolls it into a tall DataFrame.

Newdf = pd.DataFrame([])
for i,j,k in zip(lat,lon,name):
    dsloc = swe_first.sel(lat=i,lon=j,method='nearest')
    DT=dsloc.to_dataframe()

    # Insert the name of the station with preferred column title:
    DT.insert(loc=0,column="Station",value=k)
    Newdf=Newdf.append(DT,sort=True)

I would greatly appreciate any help or advice y'all can offer!

Alright I figured this one out. Turns out I needed to load my subset of data into memory first since Xarray "lazy loads" the into Dataset by default.

Here is the line of code that I revised to make this work properly:

swe_first = swe_sub.sel(time=swe_sub.time.dt.day == 1).persist()

Here is a link I found helpful for this issue:

https://examples.dask.org/xarray.html

I hope this helps someone else out too!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM