简体   繁体   中英

Read multiple files from Databricks DBFS

I've started to work with Databricks python notebooks recently and can't understand how to read multiple .csv files from DBFS as I did in Jupyter notebooks earlier.

I've tried:

path = r'dbfs:/FileStore/shared_uploads/path/' 
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0, low_memory=False)
    li.append(df)

data = pd.concat(li, axis=0, ignore_index=True)

This code worked perfectly in Jupyter notebooks, but in Databricks, I receive this error: ValueError: No objects to concatenate

I can reach one file in this path using df = pd.read_csv('dbfs_path/filename.csv')

Thanks!

You need to change path to r'/dbfs/FileStore/shared_uploads/path/'

The glob function will work with the raw filesystem attached to the driver, and has no notion of what dbfs: means.

Also, since you are combining a lot of csv files, why not read them in directly with spark:

path = r'dbfs:/FileStore/shared_uploads/path/*.csv' 
df = spark.read.csv(path)

When you are reading DBFS location, we should read through dbutils command as like this.

files = dbutils.fs.ls('/FileStore/shared_uploads/path/')
li = []
for fi in files: 
  print(fi.path)
  <your logic here>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM