简体繁体 English

如何通过AWS SageMaker在S3中保存实木复合地板？

[英]How to save parquet in S3 from AWS SageMaker?

原文 2018-03-30 18:28:26 1 1 amazon-web-services/ apache-spark/ hadoop/ amazon-s3/ amazon-sagemaker

I would like to save a Spark DataFrame from AWS SageMaker to S3. 我想将Spark DataFrame从AWS SageMaker保存到S3。 In Notebook, I ran 在笔记本中，我跑了

myDF.write.mode('overwrite').parquet("s3a://my-bucket/dir/dir2/")

I get 我懂了

Py4JJavaError: An error occurred while calling o326.parquet. Py4JJavaError：调用o326.parquet时发生错误。 : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:394) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50) at org.apache.spark.sql.exe ：java.lang.RuntimeException：java.lang.ClassNotFoundException：类org.apache.hadoop.conf.Configuration.getClass（Configuration.java:2195）找不到org.apache.hadoop.fs.s3native.NativeS3FileSystem类org.apache.hadoop.fs.FileSystem.createFileSystem（FileSystem.java:2667）上的.hadoop.fs.FileSystem.getFileSystemClass（FileSystem.java:2654），org.apache.hadoop.fs.FileSystem.access $ 200（FileSystem。 org.apache.hadoop.fs.FileSystem $ Cache.getInternal（FileSystem.java:2703）处的org.apache.hadoop.fs.FileSystem $ Cache.get（FileSystem.java:2685）处的java：94） org.apache上的.hadoop.fs.FileSystem.get（FileSystem.java:373）org.apache.hadoop.fs.Path.getFileSystem（Path.java:295）在org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat （DataSource.scala：394）位于org.apache.spark.sql.execution.datasources.DataSource.write（DataSource.scala：471）位于org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run（SaveIntoDataSourceCommand.scala ：50），位于org.apache.spark.sql.exe cution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.DataFrameWriter.runCommand 在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult（commands.scala：56）处的cution.command.ExecutedCommandExec.sideEffectResult $ lzycompute（commands.scala：58）在org.apache.spark.sql.execution处。在org.apache.spark.sql.execution处的command.ExecutedCommandExec.doExecute（commands.scala：74）在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply（SparkPlan.scala：117）处。 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply（SparkPlan.scala：138）位于org.apache.spark处的SparkPlan $$ anonfun $ execute $ 1.apply（SparkPlan.scala：117）。 org.apache.spark.sql.execution.SparkPlan.executeQuery（SparkPlan.scala：135）上的rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：151）在org.apache.spark.sql.execution.SparkPlan.execute（SparkPlan上） .orga：116）位于org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute（QueryExecution.scala：92）位于org.apache.spark.sql.execution.QueryExecution.toRdd（QueryExecution.scala：92） org.apache.spark.sql.DataFrameWriter.runCommand (DataFrameWriter.scala:609) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:508) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: Class org.apache.ha （DataFrameWriter.scala：609）在org.apache.spark.sql.DataFrameWriter.save（DataFrameWriter.scala：233）在org.apache.spark.sql.DataFrameWriter.save（DataFrameWriter.scala：217）在org.apache。 sun.reflect.NativeMethodAccessorImpl.invoke的spark.sql.DataFrameWriter.parquet（DataFrameWriter.scala：508）invoke0（native Method）在sun.reflect.NativeMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java:62）在sun.reflect.DelegatingMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java:62）在py4j.reflection.MethodInvoker.invoke（MethodInvoker.java:244）处的java.lang.reflect.Method.invoke（Method.java:498）处的DelegatingMethodAccessorImpl.java:43）在py4j.reflection.ReflectionEngine.invoke（ReflectionEngine.java ：357）在py4j.Gateway.invoke（Gateway.java:280）在py4j.commands.AbstractCommand.invokeMethod（AbstractCommand.java:132）在py4j.commands.CallCommand.execute（CallCommand.java:79）在py4j.GatewayConnection .run（GatewayConnection.java:214）at java.lang.Thread.run（Thread.java:745）原因：java.lang.ClassNotFoundException：类org.apache.ha doop.fs.s3native.NativeS3FileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) 在org.apache.hadoop.conf.Configuration.getClass（Configuration.java:2193）的org.apache.hadoop.conf.Configuration.getClassByName（Configuration.java:2101）找不到doop.fs.s3native.NativeS3FileSystem

How should I do it correctly in Notebook? 我应该如何在笔记本电脑上正确执行此操作？ Many thanks! 非常感谢！

1 个解决方案

The SageMaker notebook instance is not running Spark code, and it doesn't have the Hadoop or other Java classes that you are trying to invoke. SageMaker笔记本实例未运行Spark代码，并且没有您尝试调用的Hadoop或其他Java类。

You usually have in the Jupyter notebook in SageMaker python libraries such as Pandas, and you can use it to write the parquet file (for example, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_parquet.html ). 通常，您可以在SageMaker python库（例如Pandas）的Jupyter笔记本中使用它，并且可以使用它来编写镶木地板文件（例如， https: //pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame .to_parquet.html ）。

Another option is to connect from the Jupyter notebook to an existing (or new) Spark cluster and execute the command remotely there. 另一个选择是从Jupyter笔记本连接到现有（或新的）Spark群集，然后在此处远程执行命令。 See here for documentation on how to set this connection up: https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ 请参阅此处以获取有关如何建立此连接的文档： https : //aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/