簡體   English   中英

從 Azure Databricks 中的 Azure Datalake Gen2 讀取 .nc 文件

[英]Read .nc files from Azure Datalake Gen2 in Azure Databricks

嘗試讀取 Azure Databricks 中的 .nc (netCDF4) 文件。

從未使用過 .nc 文件

  1. 所有必需的 .nc 文件都在 Azure Datalake Gen2 中
  2. 將上述文件掛載到“ /mnt/eco_dailyRain ”的/mnt/eco_dailyRain
  3. 可以使用dbutils.fs.ls("/mnt/eco_dailyRain") OUTPUT 列出 mount 的內容:

     Out[76]: [FileInfo(path='dbfs:/mnt/eco_dailyRain/2000.daily_rain.nc', name='2000.daily_rain.nc', size=429390127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc', name='2001.daily_rain.nc', size=428217143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2002.daily_rain.nc', name='2002.daily_rain.nc', size=428218181), FileInfo(path='dbfs:/mnt/eco_dailyRain/2003.daily_rain.nc', name='2003.daily_rain.nc', size=428217139), FileInfo(path='dbfs:/mnt/eco_dailyRain/2004.daily_rain.nc', name='2004.daily_rain.nc', size=429390143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2005.daily_rain.nc', name='2005.daily_rain.nc', size=428217137), FileInfo(path='dbfs:/mnt/eco_dailyRain/2006.daily_rain.nc', name='2006.daily_rain.nc', size=428217127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2007.daily_rain.nc', name='2007.daily_rain.nc', size=428217143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2008.daily_rain.nc', name='2008.daily_rain.nc', size=429390137), FileInfo(path='dbfs:/mnt/eco_dailyRain/2009.daily_rain.nc', name='2009.daily_rain.nc', size=428217127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2010.daily_rain.nc', name='2010.daily_rain.nc', size=428217134), FileInfo(path='dbfs:/mnt/eco_dailyRain/2011.daily_rain.nc', name='2011.daily_rain.nc', size=428218181), FileInfo(path='dbfs:/mnt/eco_dailyRain/2012.daily_rain.nc', name='2012.daily_rain.nc', size=429390127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2013.daily_rain.nc', name='2013.daily_rain.nc', size=428217143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2014.daily_rain.nc', name='2014.daily_rain.nc', size=428218104), FileInfo(path='dbfs:/mnt/eco_dailyRain/2015.daily_rain.nc', name='2015.daily_rain.nc', size=428217134), FileInfo(path='dbfs:/mnt/eco_dailyRain/2016.daily_rain.nc', name='2016.daily_rain.nc', size=429390127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2017.daily_rain.nc', name='2017.daily_rain.nc', size=428217223), FileInfo(path='dbfs:/mnt/eco_dailyRain/2018.daily_rain.nc', name='2018.daily_rain.nc', size=418143765), FileInfo(path='dbfs:/mnt/eco_dailyRain/2019.daily_rain.nc', name='2019.daily_rain.nc', size=370034113), FileInfo(path='dbfs:/mnt/eco_dailyRain/Consignments.parquet', name='Consignments.parquet', size=237709917), FileInfo(path='dbfs:/mnt/eco_dailyRain/test.nc', name='test.nc', size=428217137)]

只是為了測試是否可以從 mount 讀取。

spark.read.parquet('dbfs:/mnt/eco_dailyRain/Consignments.parquet')

確認可以讀取鑲木地板文件。

輸出

Out[83]: DataFrame[CONSIGNMENT_PK: int, CERTIFICATE_NO: string, ACTOR_NAME: string, GENERATOR_FK: int, TRANSPORTER_FK: int, RECEIVER_FK: int, REC_POST_CODE: string, WASTEDESC: string, WASTE_FK: int, GEN_LICNUM: string, VOLUME: int, MEASURE: string, WASTE_TYPE: string, WASTE_ADD: string, CONTAMINENT1_FK: int, CONTAMINENT2_FK: int, CONTAMINENT3_FK: int, CONTAMINENT4_FK: int, TREATMENT_FK: int, ANZSICODE_FK: int, VEH1_REGNO: string, VEH1_LICNO: string, VEH2_REGNO: string, VEH2_LICNO: string, GEN_SIGNEE: string, GEN_DATE: timestamp, TRANS_SIGNEE: string, TRANS_DATE: timestamp, REC_SIGNEE: string, REC_DATE: timestamp, DATECREATED: timestamp, DISCREPANCY: string, APPROVAL_NUMBER: string, TR_TYPE: string, REC_WASTE_FK: int, REC_WASTE_TYPE: string, REC_VOLUME: int, REC_MEASURE: string, DATE_RECEIVED: timestamp, DATE_SCANNED: timestamp, HAS_IMAGE: string, LASTMODIFIED: timestamp]

但是嘗試讀取netCDF4文件說No such file or directory

代碼:

import datetime as dt  # Python standard library datetime  module
import numpy as np
from netCDF4 import Dataset  # http://code.google.com/p/netcdf4-python/
import matplotlib.pyplot as plt

rootgrp = Dataset("dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")

錯誤

FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'

任何線索。

根據類DatasetnetCDF4 module的 API 參考,如下圖。

在此處輸入圖片說明

Datasetpath參數的值應該是unix目錄格式的路徑,但是dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc是我知道的PySpark格式,所以你得到了錯誤FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'

修復它的解決方案是將路徑值dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc更改為等效的 unix 路徑/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc ,代碼如下。

rootgrp = Dataset("/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")

您可以通過下面的代碼進行檢查以查看它。

%sh
ls /dbfs/mnt/eco_dailyRain

當然,如果你掛載了它,你也可以通過dbutils.fs.ls('/mnt/eco_dailyRain')列出你的 netCDF4 格式的數據文件。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM