How to avoid spark from reading files whose path don't exist in s3

Question

I have some s3 files as s3://test-shivi/blah1/blah1.parquet , s3://test-shivi/blah2/blah2.parquet , s3://test-shivi/blah3/NONE

Now I want to load all the parquet via spark such as

df = spark.read.parquet("s3a:///test-shivi/*.*.parquet", schema=spark_schema)

But as blah3 doesn't have a matching file, I am getting this error.

pyspark.sql.utils.AnalysisException: Path does not exist: s3:

How can I safeguard/ skip those dirs that don't have any matching files?

Answer 1

Looks like the problem is that your path / wildcard pattern is wrong. Use this instead:

df = spark.read.parquet("s3a://test-shivi/*/*.parquet", schema=spark_schema)

If blah3 doesn't contain a parquet file, it won't match the pattern. That won't cause any issue.

But be careful with leading slashes: s3a:/// is wrong, it has to be s3a://{bucket}/ .