简体   繁体   中英

How to read a csv file from a "File Share" in an ADLS Gen2 Datalake inside Databricks using pyspark

I have ADLS Gen2 Datalake with "Blob Containers" and "File Shares". I have mounted the Blob containers in my Databricks notebook, so I can read everything there inside my databricks notebooks.

I also have some files in the "File Share", but I am not able to read these files into a dataframe thorugh Databricks using pyspark.

I have created an Access Signature for the File share and I have got the url for one of the files inside the Share as well. That url works fine through Postman. I can download that file using the url.

The sample url is shown below:

https://somedatalakename.file.core.windows.net/file_share_name/Data_20200330_1030.csv?sv=yyyy-mm-dd&si=somename&sr=s&sig=somerandomsignature%3D

How to read the same csv, which is inside this file share, into a dataframe through databricks using pyspark?

I also tried

from pyspark import SparkFiles
spark.sparkContext.addFile(uri)
call_df = spark.read.format("csv").option("header", "true").load("file://" + SparkFiles.get("Data_" + date_str + "_1030.csv"))

And I get the below error:

org.apache.spark.sql.AnalysisException: Path does not exist: file:/local_disk0/spark-ce42ed1b-5d82-4559-9000-d1bf3621539e/userFiles-eaf0fd36-68aa-409e-8610-a7909635b006/Data_20200330_1030.csv

Please give me some pointers on how to solve this problem. Thanks.

The problem with your load syntax. file: does not works in Databricks so you need to replace it with dbfs ie Databricks file system. Command to load the file:

spark.read.format("csv").option("header","true").load(f"dbfs:/path/to/your/directory/FileName.csv")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM