如何避免从读取 s3 中不存在路径的文件时产生火花

Question

I have some s3 files as s3://test-shivi/blah1/blah1.parquet , s3://test-shivi/blah2/blah2.parquet , s3://test-shivi/blah3/NONE我有一些 s3 文件为s3://test-shivi/blah1/blah1.parquet ， s3://test-shivi/blah2/blah2.parquet ， s3://test-shivi/blah3/NONE

Now I want to load all the parquet via spark such as现在我想通过火花加载所有的镶木地板，例如

df = spark.read.parquet("s3a:///test-shivi/*.*.parquet", schema=spark_schema)

But as blah3 doesn't have a matching file, I am getting this error.但是由于blah3没有匹配的文件，我收到了这个错误。

pyspark.sql.utils.AnalysisException: Path does not exist: s3:

How can I safeguard/ skip those dirs that don't have any matching files?如何保护/跳过那些没有任何匹配文件的目录？

Answer 1

Looks like the problem is that your path / wildcard pattern is wrong.看起来问题在于您的路径/通配符模式错误。 Use this instead:改用这个：

df = spark.read.parquet("s3a://test-shivi/*/*.parquet", schema=spark_schema)

If blah3 doesn't contain a parquet file, it won't match the pattern.如果blah3不包含 parquet 文件，它将与模式不匹配。 That won't cause any issue.这不会引起任何问题。

But be careful with leading slashes: s3a:/// is wrong, it has to be s3a://{bucket}/ .但要小心前导斜杠： s3a:///是错误的，它必须是s3a://{bucket}/ 。

如何避免从读取 s3 中不存在路径的文件时产生火花

问题描述

1 个解决方案

解决方案1
0 2022-07-22 04:38:47

如何避免从读取 s3 中不存在路径的文件时产生火花

问题描述

1 个解决方案

解决方案1 0 2022-07-22 04:38:47

解决方案1
0 2022-07-22 04:38:47