简体   繁体   English

EMR 5.28 无法在 s3 上加载镶木地板文件

[英]EMR 5.28 not able to load parquet files on s3

On EMR cluster 5.28.0 reading parquet files from s3 fails with below exception, whereas on EMR 5.18.0 the same works fine.在 EMR 集群 5.28.0 上,从 s3 读取镶木地板文件失败并出现以下异常,而在 EMR 5.18.0 上同样可以正常工作。 Below is the stacktrace on EMR 5.28.0.下面是 EMR 5.28.0 上的堆栈跟踪。

I tried even from spark-shell :我什至从spark-shell尝试过:

sqlContext.read.load(("s3://s3_file_path/*")
df.take(5) 

But fails with the same exception:但失败并出现相同的异常:

Job aborted due to stage failure: Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 (TID 17, ip-x.x.x.x.ec2.internal, executor 1): **org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://somedir/somesubdir/434560/1658_1564419581.parquet, range: 0-7928, partition values: [empty row], isDataPresent: false**
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:241)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:171)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
**Caused by: java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.org$apache$spark$sql$execution$datasources$parquet$ParquetFileFormat$$isCreatedByParquetMr(ParquetFileFormat.scala:352)
    at** org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:676)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:579)
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:73)
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:72)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more

I am not able to find this documented.我找不到这个记录。 Has anyone faced this issue on EMR 5.28.0 and was able to fix this?有没有人在 EMR 5.28.0 上遇到过这个问题并且能够解决这个问题?

On 5.28 I am able to read files written to s3 by EMR but reading existing parquet files written by parquet-go throws above exception whereas it works fine on EMR 5.18在 5.28 上,我能够读取由 EMR 写入 s3 的文件,但读取由 parquet-go 写入的现有 parquet 文件会抛出上述异常,而它在 EMR 5.18 上运行良好

Update: On inspecting the parquet files,older ones that work only with 5.18 have missing stats更新:在检查镶木地板文件时,仅适用于 5.18 的旧文件缺少统计信息

creator:            null 
file schema:        parquet-go-root 
timestringhr:        BINARY SNAPPY DO:0 FPO:21015 SZ:1949/25676/13.17 VC:1092 ENC:RLE,BIT_PACKED,PLAIN ST:[no stats for this column]
timeseconds:         INT64 SNAPPY DO:0 FPO:22964 SZ:1397/9064/6.49 VC:1092 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 1564419460, max: 1564419581, num_nulls not defined]

where as those which work on both EMR 5.18 and 5.28 are like在 EMR 5.18 和 5.28 上工作的那些就像

creator:            parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
extra:              org.apache.spark.sql.parquet.row.metadata = {<schema_here>}    
timestringhr:        BINARY SNAPPY DO:0 FPO:3988 SZ:156/152/0.97 VC:1092 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 2019-07-29 16:00:00, max: 2019-07-29 16:00:00, num_nulls: 0]
timeseconds:         INT64 SNAPPY DO:0 FPO:4144 SZ:954/1424/1.49 VC:1092 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 1564419460, max: 1564419581, num_nulls: 0]

This might be causing the NullPointerException.Found a related issue https://issues.apache.org/jira/browse/PARQUET-1217 with parquet-mr.这可能会导致 NullPointerException。发现与 parquet-mr 相关的问题https://issues.apache.org/jira/browse/PARQUET-1217 I can try including updated version of parquet in classpath or testing on EMR 6 beta to see if that fixes the issue.我可以尝试在类路径中包含 parquet 的更新版本或在 EMR 6 beta 上进行测试以查看是否可以解决问题。

Try to add created_by value to the footer.尝试将created_by值添加到页脚。 I traced one NPE down to the footer/created_by check in Spark.我将一个 NPE 追溯到 Spark 中的页脚/created_by 检查。 If your are using xitongsys/parquet-go , kindly consider this:如果您正在使用xitongsys/parquet-go ,请考虑:

var writer_version = "parquet-go version 1.0"
...

...
pw, err := writer.NewJSONWriter(schemaStr, fw, 4)
pw.Footer.CreatedBy = &writer_version

Ensure that your parquet files do not contain rowGroups that have zero rows.确保您的镶木地板文件不包含具有零行的 rowGroup。 You might have to debug through the file using a reader while it loads.您可能必须在加载时使用阅读器调试文件。 We encountered this in AWS Glue getting "illegal row group of 0 rows".我们在 AWS Glue 中遇到了“非法行组 0 行”的情况。

Fix: We were using the Parquet.net nuget package and restricted it from writing rowgroups if they did not contain data.修复:我们使用 Parquet.net nuget 包并限制它写入不包含数据的行组。

It most likely is caused by a lack of IAM permissions for the EMR assumed role to access the S3 location where the files are located.这很可能是由于 EMR 代入角色缺少 IAM 权限,无法访问文件所在的 S3 位置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Athena 读取 s3 中的 Parquet 文件 - Reading Parquet files in s3 with Athena S3 Select 会加速 Spark 对 Parquet 文件的分析吗? - Would S3 Select speed up Spark analyses on Parquet files? 将镶木地板文件写入现有的 AWS S3 存储桶 - Writing parquet files into existing AWS S3 bucket S3 bucketpolicy 和 EMR - 示例 - S3 bucketpolicy and EMR - example 使用 python,有没有办法将 polars dataframe 作为镶木地板直接加载到 s3 存储桶中 - with python, is there a way to load a polars dataframe directly into an s3 bucket as parquet 使用 AWS Glue 将 AWS Redshift 转换为 S3 Parquet 文件 - AWS Redshift to S3 Parquet Files Using AWS Glue 从 S3 读取许多 parquet 文件到 pandas dataframe - Read many parquet files from S3 to pandas dataframe Snowflake - 如何从 S3 中的镶木地板文件中读取元数据 - Snowflake - how to read metadata from parquet files in S3 AWS Glue - 将所有 S3 JSON 文件组合成具有大小限制的 S3 Parquet 文件 - AWS Glue - Combining all S3 JSON files into S3 Parquet files with a size limit 解决从数据帧导出的 S3 上的红移表和镶木地板文件之间的数据类型不匹配 - Resolving datatype missmatch between redshift tables and parquet files on S3 exported from dataframes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM