Iterate through files from pyspark in databricks from DBFS location in Community Edition

Question

I want to iterate through the files available in the DBFS location in Databricks. But it's throwing an error saying 'org.apache.spark.sql.AnalysisException: Path does not exist:' Here's the code which I tried:

import os
from pyspark.sql.types import *
fileDirectory = '/dbfs/FileStore/tables/'
for fname in os.listdir(fileDirectory):
    df_app = sqlContext.read.format("csv").\
        option("header", "true"). \`enter code here`
        load(fileDirectory + fname)

And the error is

org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/dbfs/FileStore/tables/Dept_data.csv;

Can you please help with this.

Thanks in Advance

Answer 1

When reading files in Databricks using the DataFrameReaders (ie: spark.read... ), the paths are read directly from DBFS, where the FileStore tables directory is, in fact: dbfs:/FileStore/tables/ . The point is that, using the Python os library, the DBFS is another path folder (and that is why you can access it using /dbfs/FileStore/tables). So, using something like this should work fine:

import os
from pyspark.sql.types import *
fileDirectory = '/dbfs/FileStore/tables/'
dir = '/FileStore/tables/'
for fname in os.listdir(fileDirectory):
    df_app = sqlContext.read.format("json").option("header", "true").load(dir + fname)

In addition, you can double check the dbutils commands ( https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command ) that can help you to manipulate the DBFS directly (without dealing with dbfs inner implementation). Hope this helps

Iterate through files from pyspark in databricks from DBFS location in Community Edition

Question

1 answers

solution1
0 2020-01-03 11:24:46

Iterate through files from pyspark in databricks from DBFS location in Community Edition

Question

1 answers

solution1 0 2020-01-03 11:24:46

solution1
0 2020-01-03 11:24:46