繁体   English   中英

从 s3 加载带前缀的镶木地板文件 - 可疑路径

[英]Loading prefixed parquet files from s3 - Suspicious paths

我有一组带前缀的(根据 S3 性能建议)镶木地板文件,我想在 spark 中加载(使用 Amazon EMR 5.11.1)但是

  1. 列出与 glob 匹配的文件集所花费的时间比非前缀文件慢得多,这可以改进吗?
  2. 如何避免以下错误?

 val df = spark.read.parquet("s3://bucket/????/analytics")

java.lang.AssertionError: assertion failed: Conflicting directory
     structures detected. Suspicious paths:?
        s3://bucket/4a73/analytics
        s3://bucket/8163/analytics

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
  at scala.Predef$.assert(Predef.scala:170)
  at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:132)
  at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:97)
  at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:153)
  at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:70)
  at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:134)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
  ... 48 elided

您可以使用 s3a 代替 s3。 这可能适合你。

1.您还需要类路径上的 hadoop-aws 2.7.1 JAR。 这个 JAR 包含

class org.apache.hadoop.fs.s3a.S3AFileSystem.

2.在 spark.properties 中,您可以进行如下设置:

spark.hadoop.fs.s3a.access.key=ACCESSKEY  
spark.hadoop.fs.s3a.secret.key=SECRETKEY

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM