[英]How to avoid spark from reading files whose path don't exist in s3
I have some s3 files as s3://test-shivi/blah1/blah1.parquet
, s3://test-shivi/blah2/blah2.parquet
, s3://test-shivi/blah3/NONE
我有一些 s3 文件为
s3://test-shivi/blah1/blah1.parquet
, s3://test-shivi/blah2/blah2.parquet
, s3://test-shivi/blah3/NONE
Now I want to load all the parquet via spark such as现在我想通过火花加载所有的镶木地板,例如
df = spark.read.parquet("s3a:///test-shivi/*.*.parquet", schema=spark_schema)
But as blah3
doesn't have a matching file, I am getting this error.但是由于
blah3
没有匹配的文件,我收到了这个错误。
pyspark.sql.utils.AnalysisException: Path does not exist: s3:
How can I safeguard/ skip those dirs that don't have any matching files?如何保护/跳过那些没有任何匹配文件的目录?
Looks like the problem is that your path / wildcard pattern is wrong.看起来问题在于您的路径/通配符模式错误。 Use this instead:
改用这个:
df = spark.read.parquet("s3a://test-shivi/*/*.parquet", schema=spark_schema)
If blah3
doesn't contain a parquet file, it won't match the pattern.如果
blah3
不包含 parquet 文件,它将与模式不匹配。 That won't cause any issue.这不会引起任何问题。
But be careful with leading slashes: s3a:///
is wrong, it has to be s3a://{bucket}/
.但要小心前导斜杠:
s3a:///
是错误的,它必须是s3a://{bucket}/
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.