简体   繁体   中英

Iterate through files from pyspark in databricks from DBFS location in Community Edition

I want to iterate through the files available in the DBFS location in Databricks. But it's throwing an error saying 'org.apache.spark.sql.AnalysisException: Path does not exist:' Here's the code which I tried:

import os
from pyspark.sql.types import *
fileDirectory = '/dbfs/FileStore/tables/'
for fname in os.listdir(fileDirectory):
    df_app = sqlContext.read.format("csv").\
        option("header", "true"). \`enter code here`
        load(fileDirectory + fname)

And the error is

org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/dbfs/FileStore/tables/Dept_data.csv;

Can you please help with this.

Thanks in Advance

When reading files in Databricks using the DataFrameReaders (ie: spark.read... ), the paths are read directly from DBFS, where the FileStore tables directory is, in fact: dbfs:/FileStore/tables/ . The point is that, using the Python os library, the DBFS is another path folder (and that is why you can access it using /dbfs/FileStore/tables). So, using something like this should work fine:

import os
from pyspark.sql.types import *
fileDirectory = '/dbfs/FileStore/tables/'
dir = '/FileStore/tables/'
for fname in os.listdir(fileDirectory):
    df_app = sqlContext.read.format("json").option("header", "true").load(dir + fname)

In addition, you can double check the dbutils commands ( https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command ) that can help you to manipulate the DBFS directly (without dealing with dbfs inner implementation). Hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM