简体   繁体   中英

How to avoid spark from reading files whose path don't exist in s3

I have some s3 files as s3://test-shivi/blah1/blah1.parquet , s3://test-shivi/blah2/blah2.parquet , s3://test-shivi/blah3/NONE

Now I want to load all the parquet via spark such as

df = spark.read.parquet("s3a:///test-shivi/*.*.parquet", schema=spark_schema)

But as blah3 doesn't have a matching file, I am getting this error.

pyspark.sql.utils.AnalysisException: Path does not exist: s3:

How can I safeguard/ skip those dirs that don't have any matching files?

Looks like the problem is that your path / wildcard pattern is wrong. Use this instead:

df = spark.read.parquet("s3a://test-shivi/*/*.parquet", schema=spark_schema)

If blah3 doesn't contain a parquet file, it won't match the pattern. That won't cause any issue.

But be careful with leading slashes: s3a:/// is wrong, it has to be s3a://{bucket}/ .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM