简体   繁体   English

如何避免从读取 s3 中不存在路径的文件时产生火花

[英]How to avoid spark from reading files whose path don't exist in s3

I have some s3 files as s3://test-shivi/blah1/blah1.parquet , s3://test-shivi/blah2/blah2.parquet , s3://test-shivi/blah3/NONE我有一些 s3 文件为s3://test-shivi/blah1/blah1.parquets3://test-shivi/blah2/blah2.parquets3://test-shivi/blah3/NONE

Now I want to load all the parquet via spark such as现在我想通过火花加载所有的镶木地板,例如

df = spark.read.parquet("s3a:///test-shivi/*.*.parquet", schema=spark_schema)

But as blah3 doesn't have a matching file, I am getting this error.但是由于blah3没有匹配的文件,我收到了这个错误。

pyspark.sql.utils.AnalysisException: Path does not exist: s3:

How can I safeguard/ skip those dirs that don't have any matching files?如何保护/跳过那些没有任何匹配文件的目录?

Looks like the problem is that your path / wildcard pattern is wrong.看起来问题在于您的路径/通配符模式错误。 Use this instead:改用这个:

df = spark.read.parquet("s3a://test-shivi/*/*.parquet", schema=spark_schema)

If blah3 doesn't contain a parquet file, it won't match the pattern.如果blah3不包含 parquet 文件,它将与模式不匹配。 That won't cause any issue.这不会引起任何问题。

But be careful with leading slashes: s3a:/// is wrong, it has to be s3a://{bucket}/ .但要小心前导斜杠: s3a:///是错误的,它必须是s3a://{bucket}/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM