从 Azure Databricks 中的 Azure Datalake Gen2 读取 .nc 文件

Question

Trying to read .nc (netCDF4) files in Azure Databricks.尝试读取 Azure Databricks 中的 .nc (netCDF4) 文件。

Never worked with .nc files从未使用过 .nc 文件

All the required .nc files are in Azure Datalake Gen2所有必需的 .nc 文件都在 Azure Datalake Gen2 中
Mounted above files into Databricks at " /mnt/eco_dailyRain "将上述文件挂载到“ /mnt/eco_dailyRain ”的/mnt/eco_dailyRain

Can list the content of mount using dbutils.fs.ls("/mnt/eco_dailyRain") OUTPUT:可以使用dbutils.fs.ls("/mnt/eco_dailyRain") OUTPUT 列出 mount 的内容：

 Out[76]: [FileInfo(path='dbfs:/mnt/eco_dailyRain/2000.daily_rain.nc', name='2000.daily_rain.nc', size=429390127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc', name='2001.daily_rain.nc', size=428217143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2002.daily_rain.nc', name='2002.daily_rain.nc', size=428218181), FileInfo(path='dbfs:/mnt/eco_dailyRain/2003.daily_rain.nc', name='2003.daily_rain.nc', size=428217139), FileInfo(path='dbfs:/mnt/eco_dailyRain/2004.daily_rain.nc', name='2004.daily_rain.nc', size=429390143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2005.daily_rain.nc', name='2005.daily_rain.nc', size=428217137), FileInfo(path='dbfs:/mnt/eco_dailyRain/2006.daily_rain.nc', name='2006.daily_rain.nc', size=428217127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2007.daily_rain.nc', name='2007.daily_rain.nc', size=428217143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2008.daily_rain.nc', name='2008.daily_rain.nc', size=429390137), FileInfo(path='dbfs:/mnt/eco_dailyRain/2009.daily_rain.nc', name='2009.daily_rain.nc', size=428217127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2010.daily_rain.nc', name='2010.daily_rain.nc', size=428217134), FileInfo(path='dbfs:/mnt/eco_dailyRain/2011.daily_rain.nc', name='2011.daily_rain.nc', size=428218181), FileInfo(path='dbfs:/mnt/eco_dailyRain/2012.daily_rain.nc', name='2012.daily_rain.nc', size=429390127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2013.daily_rain.nc', name='2013.daily_rain.nc', size=428217143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2014.daily_rain.nc', name='2014.daily_rain.nc', size=428218104), FileInfo(path='dbfs:/mnt/eco_dailyRain/2015.daily_rain.nc', name='2015.daily_rain.nc', size=428217134), FileInfo(path='dbfs:/mnt/eco_dailyRain/2016.daily_rain.nc', name='2016.daily_rain.nc', size=429390127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2017.daily_rain.nc', name='2017.daily_rain.nc', size=428217223), FileInfo(path='dbfs:/mnt/eco_dailyRain/2018.daily_rain.nc', name='2018.daily_rain.nc', size=418143765), FileInfo(path='dbfs:/mnt/eco_dailyRain/2019.daily_rain.nc', name='2019.daily_rain.nc', size=370034113), FileInfo(path='dbfs:/mnt/eco_dailyRain/Consignments.parquet', name='Consignments.parquet', size=237709917), FileInfo(path='dbfs:/mnt/eco_dailyRain/test.nc', name='test.nc', size=428217137)]

Just to test wether can read from mount.只是为了测试是否可以从 mount 读取。

spark.read.parquet('dbfs:/mnt/eco_dailyRain/Consignments.parquet')

confirms can read parquet file.确认可以读取镶木地板文件。

output输出

Out[83]: DataFrame[CONSIGNMENT_PK: int, CERTIFICATE_NO: string, ACTOR_NAME: string, GENERATOR_FK: int, TRANSPORTER_FK: int, RECEIVER_FK: int, REC_POST_CODE: string, WASTEDESC: string, WASTE_FK: int, GEN_LICNUM: string, VOLUME: int, MEASURE: string, WASTE_TYPE: string, WASTE_ADD: string, CONTAMINENT1_FK: int, CONTAMINENT2_FK: int, CONTAMINENT3_FK: int, CONTAMINENT4_FK: int, TREATMENT_FK: int, ANZSICODE_FK: int, VEH1_REGNO: string, VEH1_LICNO: string, VEH2_REGNO: string, VEH2_LICNO: string, GEN_SIGNEE: string, GEN_DATE: timestamp, TRANS_SIGNEE: string, TRANS_DATE: timestamp, REC_SIGNEE: string, REC_DATE: timestamp, DATECREATED: timestamp, DISCREPANCY: string, APPROVAL_NUMBER: string, TR_TYPE: string, REC_WASTE_FK: int, REC_WASTE_TYPE: string, REC_VOLUME: int, REC_MEASURE: string, DATE_RECEIVED: timestamp, DATE_SCANNED: timestamp, HAS_IMAGE: string, LASTMODIFIED: timestamp]

But trying to read netCDF4 files says No such file or directory但是尝试读取netCDF4文件说No such file or directory

Code:代码：

import datetime as dt  # Python standard library datetime  module
import numpy as np
from netCDF4 import Dataset  # http://code.google.com/p/netcdf4-python/
import matplotlib.pyplot as plt

rootgrp = Dataset("dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")

Error错误

FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'

Any clues.任何线索。

Answer 1

According to the API reference of netCDF4 module for class Dataset , as the figure below.根据类Dataset的netCDF4 module的 API 参考，如下图。

The value of the path parameter for Dataset should be a path of unix directory format, but the path dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc is a format for PySpark as I known, so you got the error FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc' . Dataset的path参数的值应该是unix目录格式的路径，但是dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc是我知道的PySpark格式，所以你得到了错误FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc' 。

The solution to fix it is to change the path value dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc with the equivalence unix path /dbfs/mnt/eco_dailyRain/2001.daily_rain.nc , the code as below.修复它的解决方案是将路径值dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc更改为等效的 unix 路径/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc ，代码如下。

rootgrp = Dataset("/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")

You can check it via the code below to see it.您可以通过下面的代码进行检查以查看它。

%sh
ls /dbfs/mnt/eco_dailyRain

Ofcouse, you also can list your data files of netCDF4 format via dbutils.fs.ls('/mnt/eco_dailyRain') if you had mount it.当然，如果你挂载了它，你也可以通过dbutils.fs.ls('/mnt/eco_dailyRain')列出你的 netCDF4 格式的数据文件。

从 Azure Databricks 中的 Azure Datalake Gen2 读取 .nc 文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-11-29 07:35:39

从 Azure Databricks 中的 Azure Datalake Gen2 读取 .nc 文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-11-29 07:35:39

解决方案1
0 已采纳 2019-11-29 07:35:39