简体   繁体   中英

Read a zip file in databricks from Azure Storage Explorer

I want to read zip files that have csv files. I have tried many ways but I have not succeeded. In my case, the path where I should read the file is in Azure Storage Explorer.

For example, when I have to read a csv in databricks I use the following code:

dfDemandaBilletesCmbinad = spark.read.csv("/mnt/data/myCSVfile.csv", header=True)

So, the Azure Storage path that I want is "/mnt/data/myZipFile.zip" , which inside I have some csv files.

Is it possible to read csv files coming from Azure storage via pySpark in databricks ?

I think that the only way to do this is with Pandas, openpyxl and zip library for python, as there're no similar library for pySpark.

import pandas as pd
import openpyxl, zipfile

#Unzip and extract in file. Maybe, could be better to unzip in memory with StringIO.
with zipfile.ZipFile('/dbfs/mnt/data/file.zip', 'r') as zip_ref:
    zip_ref.extractall('/dbfs/mnt/data/unzipped')

#read excel
my_excel = openpyxl.load_workbook('/dbfs/mnt/data/unzipped/file.xlsx') 
ws = my_excel['worksheet1']

# create pandas dataframe
df = pd.DataFrame(ws.values)

# create spark dataframe
spark_df = spark.createDataFrame(df)

The problem is that this only is being executed in the driver VM of the cluster.

Please keep in mind that the Azure Storage Explorer does not store any data. It's a tool that lets you access your Azure storage account from any device and on any platform. Data always stored in an Azure storage account.

In your scenario, it appears that your Azure storage account is already mounted to the Databricks DBFS file path. Since it is mounted, you can use spark.read command access the file directly from Azure storage account

Sample df = spark.read.text("dbfs:/mymount/my_file.txt")

Reference: https://docs.databricks.com/data/databricks-file-system.html

and regarding ZIP file please refer

https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM