简体   繁体   English

从 Azure Databricks 中的 Azure Datalake Gen2 读取 .nc 文件

[英]Read .nc files from Azure Datalake Gen2 in Azure Databricks

Trying to read .nc (netCDF4) files in Azure Databricks.尝试读取 Azure Databricks 中的 .nc (netCDF4) 文件。

Never worked with .nc files从未使用过 .nc 文件

  1. All the required .nc files are in Azure Datalake Gen2所有必需的 .nc 文件都在 Azure Datalake Gen2 中
  2. Mounted above files into Databricks at " /mnt/eco_dailyRain "将上述文件挂载到“ /mnt/eco_dailyRain ”的/mnt/eco_dailyRain
  3. Can list the content of mount using dbutils.fs.ls("/mnt/eco_dailyRain") OUTPUT:可以使用dbutils.fs.ls("/mnt/eco_dailyRain") OUTPUT 列出 mount 的内容:

     Out[76]: [FileInfo(path='dbfs:/mnt/eco_dailyRain/2000.daily_rain.nc', name='2000.daily_rain.nc', size=429390127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc', name='2001.daily_rain.nc', size=428217143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2002.daily_rain.nc', name='2002.daily_rain.nc', size=428218181), FileInfo(path='dbfs:/mnt/eco_dailyRain/2003.daily_rain.nc', name='2003.daily_rain.nc', size=428217139), FileInfo(path='dbfs:/mnt/eco_dailyRain/2004.daily_rain.nc', name='2004.daily_rain.nc', size=429390143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2005.daily_rain.nc', name='2005.daily_rain.nc', size=428217137), FileInfo(path='dbfs:/mnt/eco_dailyRain/2006.daily_rain.nc', name='2006.daily_rain.nc', size=428217127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2007.daily_rain.nc', name='2007.daily_rain.nc', size=428217143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2008.daily_rain.nc', name='2008.daily_rain.nc', size=429390137), FileInfo(path='dbfs:/mnt/eco_dailyRain/2009.daily_rain.nc', name='2009.daily_rain.nc', size=428217127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2010.daily_rain.nc', name='2010.daily_rain.nc', size=428217134), FileInfo(path='dbfs:/mnt/eco_dailyRain/2011.daily_rain.nc', name='2011.daily_rain.nc', size=428218181), FileInfo(path='dbfs:/mnt/eco_dailyRain/2012.daily_rain.nc', name='2012.daily_rain.nc', size=429390127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2013.daily_rain.nc', name='2013.daily_rain.nc', size=428217143), FileInfo(path='dbfs:/mnt/eco_dailyRain/2014.daily_rain.nc', name='2014.daily_rain.nc', size=428218104), FileInfo(path='dbfs:/mnt/eco_dailyRain/2015.daily_rain.nc', name='2015.daily_rain.nc', size=428217134), FileInfo(path='dbfs:/mnt/eco_dailyRain/2016.daily_rain.nc', name='2016.daily_rain.nc', size=429390127), FileInfo(path='dbfs:/mnt/eco_dailyRain/2017.daily_rain.nc', name='2017.daily_rain.nc', size=428217223), FileInfo(path='dbfs:/mnt/eco_dailyRain/2018.daily_rain.nc', name='2018.daily_rain.nc', size=418143765), FileInfo(path='dbfs:/mnt/eco_dailyRain/2019.daily_rain.nc', name='2019.daily_rain.nc', size=370034113), FileInfo(path='dbfs:/mnt/eco_dailyRain/Consignments.parquet', name='Consignments.parquet', size=237709917), FileInfo(path='dbfs:/mnt/eco_dailyRain/test.nc', name='test.nc', size=428217137)]

Just to test wether can read from mount.只是为了测试是否可以从 mount 读取。

spark.read.parquet('dbfs:/mnt/eco_dailyRain/Consignments.parquet')

confirms can read parquet file.确认可以读取镶木地板文件。

output输出

Out[83]: DataFrame[CONSIGNMENT_PK: int, CERTIFICATE_NO: string, ACTOR_NAME: string, GENERATOR_FK: int, TRANSPORTER_FK: int, RECEIVER_FK: int, REC_POST_CODE: string, WASTEDESC: string, WASTE_FK: int, GEN_LICNUM: string, VOLUME: int, MEASURE: string, WASTE_TYPE: string, WASTE_ADD: string, CONTAMINENT1_FK: int, CONTAMINENT2_FK: int, CONTAMINENT3_FK: int, CONTAMINENT4_FK: int, TREATMENT_FK: int, ANZSICODE_FK: int, VEH1_REGNO: string, VEH1_LICNO: string, VEH2_REGNO: string, VEH2_LICNO: string, GEN_SIGNEE: string, GEN_DATE: timestamp, TRANS_SIGNEE: string, TRANS_DATE: timestamp, REC_SIGNEE: string, REC_DATE: timestamp, DATECREATED: timestamp, DISCREPANCY: string, APPROVAL_NUMBER: string, TR_TYPE: string, REC_WASTE_FK: int, REC_WASTE_TYPE: string, REC_VOLUME: int, REC_MEASURE: string, DATE_RECEIVED: timestamp, DATE_SCANNED: timestamp, HAS_IMAGE: string, LASTMODIFIED: timestamp]

But trying to read netCDF4 files says No such file or directory但是尝试读取netCDF4文件说No such file or directory

Code:代码:

import datetime as dt  # Python standard library datetime  module
import numpy as np
from netCDF4 import Dataset  # http://code.google.com/p/netcdf4-python/
import matplotlib.pyplot as plt

rootgrp = Dataset("dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")

Error错误

FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'

Any clues.任何线索。

According to the API reference of netCDF4 module for class Dataset , as the figure below.根据类DatasetnetCDF4 module的 API 参考,如下图。

在此处输入图片说明

The value of the path parameter for Dataset should be a path of unix directory format, but the path dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc is a format for PySpark as I known, so you got the error FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc' . Datasetpath参数的值应该是unix目录格式的路径,但是dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc是我知道的PySpark格式,所以你得到了错误FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'

The solution to fix it is to change the path value dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc with the equivalence unix path /dbfs/mnt/eco_dailyRain/2001.daily_rain.nc , the code as below.修复它的解决方案是将路径值dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc更改为等效的 unix 路径/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc ,代码如下。

rootgrp = Dataset("/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")

You can check it via the code below to see it.您可以通过下面的代码进行检查以查看它。

%sh
ls /dbfs/mnt/eco_dailyRain

Ofcouse, you also can list your data files of netCDF4 format via dbutils.fs.ls('/mnt/eco_dailyRain') if you had mount it.当然,如果你挂载了它,你也可以通过dbutils.fs.ls('/mnt/eco_dailyRain')列出你的 netCDF4 格式的数据文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 创建范围以从 Databricks 访问 Azure Datalake Gen2 时出现属性错误 - Attribute error while creating scope to access Azure Datalake Gen2 from Databricks 为什么 Databricks Python 不能从我的 Azure Datalake Storage Gen1 读取? - Why can't Databricks Python read from my Azure Datalake Storage Gen1? Azure Function Python 写入 Azure DataLake Gen2 - Azure Function Python write to Azure DataLake Gen2 我可以使用Python SDK访问来自Azure Datalake Gen2的数据吗? - Can I use Python SDK to access data from azure datalake gen2? 通过数据块从 ADLS gen2 存储中的多个文件夹中读取文件并创建单个目标文件 - Read files from multiple folders from ADLS gen2 storage via databricks and create single target file 如何循环访问 Azure Databricks 中的 Azure Datalake Store 文件 - How to loop through Azure Datalake Store files in Azure Databricks 如何在不下载的情况下直接访问 Azure datalake gen2 中存在的 .txt 文件 - How can I access a .txt file which is present in Azure datalake gen2 directly without downloading Azure 功能和 DataLake gen 2 连接 - Azure Functions and DataLake gen 2 connection 如何在没有火花的情况下直接从 azure 数据湖读取镶木地板文件? - How to read parquet files directly from azure datalake without spark? 使用 Python(无 ADB)读取 Azure ADLS Gen2 文件 - Azure ADLS Gen2 File read using Python (without ADB)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM