Read multiple files from Databricks DBFS

Question

I've started to work with Databricks python notebooks recently and can't understand how to read multiple .csv files from DBFS as I did in Jupyter notebooks earlier.

I've tried:

path = r'dbfs:/FileStore/shared_uploads/path/' 
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0, low_memory=False)
    li.append(df)

data = pd.concat(li, axis=0, ignore_index=True)

This code worked perfectly in Jupyter notebooks, but in Databricks, I receive this error: ValueError: No objects to concatenate

I can reach one file in this path using df = pd.read_csv('dbfs_path/filename.csv')

Thanks!

Answer 1

You need to change path to r'/dbfs/FileStore/shared_uploads/path/'

The glob function will work with the raw filesystem attached to the driver, and has no notion of what dbfs: means.

Also, since you are combining a lot of csv files, why not read them in directly with spark:

path = r'dbfs:/FileStore/shared_uploads/path/*.csv' 
df = spark.read.csv(path)

Answer 2

When you are reading DBFS location, we should read through dbutils command as like this.

files = dbutils.fs.ls('/FileStore/shared_uploads/path/')
li = []
for fi in files: 
  print(fi.path)
  <your logic here>

Read multiple files from Databricks DBFS

Question

2 answers

solution1
0 2021-11-24 09:36:35

solution2
0 2021-11-24 17:02:10

Read multiple files from Databricks DBFS

Question

2 answers

solution1 0 2021-11-24 09:36:35

solution2 0 2021-11-24 17:02:10

solution1
0 2021-11-24 09:36:35

solution2
0 2021-11-24 17:02:10