Databricks上的PySpark：读取从Azure Blob存储复制的CSV文件会导致java.io.FileNotFoundException

Question

I am running Azure Databricks 4.3 (includes Apache Spark 2.3.1, Scala 2.11). 我正在运行Azure Databricks 4.3（包括Apache Spark 2.3.1，Scala 2.11）。

I copied a CSV file from Azure Blob Storage into Databricks cluster using dbutils.fs.cp into disk by adding file: to the absolute local_path : 我复制一个CSV从Azure的Blob存储使用的文件到Databricks集群dbutils.fs.cp加入到磁盘file:绝对local_path ：

copy_to = "file:" + local_path
dbutils.fs.cp(blob_storage_path, copy_to)

When I then try to read the file using the same path with file: added in front: 然后，当我尝试使用与file:相同的路径读取文件时，添加在前面：

csv_spark_df = sqlContext.read.format('csv').options(header='true', inferSchema='true').load(copy_to)

I am getting an error message denoting that the given path does not exist: 我收到一条错误消息，指出给定的路径不存在：

java.io.FileNotFoundException: File file:/<local_path>

When I mount the Azure Blob Storage container, as described below, then I can read the file correctly with Spark using the same snippet above, using the absolute local_path of the file in the mounted directory: 如下所述，当我安装Azure Blob存储容器时，我可以使用已安装目录中文件的绝对local_path ，使用与上述相同的代码段正确地使用Spark读取文件：

https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs

Is it at all possible to read the CSV file that was copied from the Azure Blob Storage or is the solution using mounting of the Azure Blob Storage container the preferred one anyway? 是否完全可以读取从Azure Blob存储复制的CSV文件，或者始终首选使用安装Azure Blob存储容器的解决方案？

Answer 1

I'm not certain what the file: will map to. 我不确定文件将映射到什么。

I would have expected the path to be a DBFS path: 我希望该路径是DBFS路径：

copy_to = "/path/file.csv"

This will be assumed to a DBFS path. 这将假定为DBFS路径。

You can always do: 您可以随时这样做：

dbutils.fs.ls("/path")

To verify the file copy. 验证文件副本。

Though please note you do not need to copy the file to DBFS to load into a dataframe - you can read directly from the blob storage account. 尽管请注意，您无需将文件复制到DBFS即可加载到数据帧中-您可以直接从Blob存储帐户读取。 That would be the normal approach. 那是正常的做法。 Is there a reason you want to copy it locally? 您是否有理由要在本地复制它？

Databricks上的PySpark：读取从Azure Blob存储复制的CSV文件会导致java.io.FileNotFoundException

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-01-28 16:40:27

Databricks上的PySpark：读取从Azure Blob存储复制的CSV文件会导致java.io.FileNotFoundException

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-01-28 16:40:27

解决方案1
1 已采纳 2019-01-28 16:40:27