[英]Loading prefixed parquet files from s3 - Suspicious paths
我有一组带前缀的(根据 S3 性能建议)镶木地板文件,我想在 spark 中加载(使用 Amazon EMR 5.11.1)但是
val df = spark.read.parquet("s3://bucket/????/analytics")
java.lang.AssertionError: assertion failed: Conflicting directory
structures detected. Suspicious paths:?
s3://bucket/4a73/analytics
s3://bucket/8163/analytics
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:132)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:97)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:153)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:70)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:134)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
... 48 elided
您可以使用 s3a 代替 s3。 这可能适合你。
1.您还需要类路径上的 hadoop-aws 2.7.1 JAR。 这个 JAR 包含
class org.apache.hadoop.fs.s3a.S3AFileSystem.
2.在 spark.properties 中,您可以进行如下设置:
spark.hadoop.fs.s3a.access.key=ACCESSKEY
spark.hadoop.fs.s3a.secret.key=SECRETKEY
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.