简体   繁体   English

Databricks上的PySpark:读取从Azure Blob存储复制的CSV文件会导致java.io.FileNotFoundException

[英]PySpark on Databricks: Reading a CSV file copied from the Azure Blob Storage results in java.io.FileNotFoundException

I am running Azure Databricks 4.3 (includes Apache Spark 2.3.1, Scala 2.11). 我正在运行Azure Databricks 4.3(包括Apache Spark 2.3.1,Scala 2.11)。

I copied a CSV file from Azure Blob Storage into Databricks cluster using dbutils.fs.cp into disk by adding file: to the absolute local_path : 我复制一个CSV从Azure的Blob存储使用的文件到Databricks集群dbutils.fs.cp加入到磁盘file:绝对local_path

copy_to = "file:" + local_path
dbutils.fs.cp(blob_storage_path, copy_to)

When I then try to read the file using the same path with file: added in front: 然后,当我尝试使用与file:相同的路径读取文件时,添加在前面:

csv_spark_df = sqlContext.read.format('csv').options(header='true', inferSchema='true').load(copy_to)

I am getting an error message denoting that the given path does not exist: 我收到一条错误消息,指出给定的路径不存在:

java.io.FileNotFoundException: File file:/<local_path>

When I mount the Azure Blob Storage container, as described below, then I can read the file correctly with Spark using the same snippet above, using the absolute local_path of the file in the mounted directory: 如下所述,当我安装Azure Blob存储容器时,我可以使用已安装目录中文件的绝对local_path ,使用与上述相同的代码段正确地使用Spark读取文件:

https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs

Is it at all possible to read the CSV file that was copied from the Azure Blob Storage or is the solution using mounting of the Azure Blob Storage container the preferred one anyway? 是否完全可以读取从Azure Blob存储复制的CSV文件,或者始终首选使用安装Azure Blob存储容器的解决方案?

I'm not certain what the file: will map to. 我不确定文件将映射到什么。

I would have expected the path to be a DBFS path: 我希望该路径是DBFS路径:

copy_to = "/path/file.csv"

This will be assumed to a DBFS path. 这将假定为DBFS路径。

You can always do: 您可以随时这样做:

dbutils.fs.ls("/path")

To verify the file copy. 验证文件副本。

Though please note you do not need to copy the file to DBFS to load into a dataframe - you can read directly from the blob storage account. 尽管请注意,您无需将文件复制到DBFS即可加载到数据帧中-您可以直接从Blob存储帐户读取。 That would be the normal approach. 那是正常的做法。 Is there a reason you want to copy it locally? 您是否有理由要在本地复制它?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用pyspark从Azure Blob存储读取(txt,csv)文件 - Reading (txt , csv) FIle from Azure blob storage using pyspark 使用 Databricks PySpark 从 Azure blob 存储读取多个 CSV 文件 - Reading multiple CSV files from Azure blob storage using Databricks PySpark 将 Pandas 或 Pyspark dataframe 从 Databricks 保存到 Azure Blob 存储 - Save Pandas or Pyspark dataframe from Databricks to Azure Blob Storage 使用/ mnt /将数据从Azure Blob存储读取到Azure Databricks - Reading data from Azure Blob Storage into Azure Databricks using /mnt/ Maven 测试在 Azure DevOps 中失败。 java.io.FileNotFoundException: TestSuiteList.xls(没有这样的文件或目录) - Maven tests failing in Azure DevOps. java.io.FileNotFoundException: TestSuiteList.xls (No such file or directory) 无法再将文件从 Databricks 保存到 Azure Blob 存储 - Can't save file from Databricks to Azure Blob Storage Anymore Azure Databricks在Blob存储上打开文件的问题 - Problems with Azure Databricks opening a file on the Blob Storage 从Excel读取存储在Azure Blob存储中的.csv - Reading .csv stored in Azure Blob Storage from Excel 将 dataframe 保存为 csv 文件(在 databricks 中处理)并将其上传到 azure datalake blob 存储 - Saving a dataframe as a csv file(processed in databricks) and uploading it to azure datalake blob storage Azure Databricks - 无法将结果从 Databricks 导出到 blob - Azure Databricks - Cannot export results from Databricks to blob
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM