简体   繁体   English

在AWS Glue中读取镶木地板文件

[英]Reading parquet files in AWS Glue

I'm a AWS Glue newbie that is trying to read some parquet objects that I have in S3 but I fail by ClassNotFoundException. 我是一个AWS胶水新手试图读取我在S3中的一些镶木地板对象,但我失败了ClassNotFoundException。 This is my attempt so far based on the minimal documentation of Glue: 这是我到目前为止基于Glue的最小文档的尝试:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.sql.SparkSession

val gc: GlueContext = new GlueContext(sc)

val spark_session : SparkSession = gc.getSparkSession

val source = gc.getSource("s3", JsonOptions(Map("paths" -> Set("s3://path-to-parquet"))))

val parquetSource = source.withFormat("parquet")

parquetSource.getDynamicFrame().show(1)

And the exception: 例外情况:

   18/06/11 13:39:11 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 266, ip-172-31-8-179.eu-west-1.compute.internal, executor 16): java.lang.ClassNotFoundException: Failed to load format with name parquet
    at com.amazonaws.services.glue.util.ClassUtils$.loadByFullName(ClassUtils.scala:28)
    at com.amazonaws.services.glue.util.ClassUtils$.getClassByName(ClassUtils.scala:43)
    at com.amazonaws.services.glue.util.ClassUtils$.newInstanceByName(ClassUtils.scala:54)
    at com.amazonaws.services.glue.readers.DynamicRecordStreamReader$.apply(DynamicRecordReader.scala:187)
    ...
Caused by: java.lang.ClassNotFoundException: parquet
    at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at com.amazonaws.services.glue.util.ClassUtils$$anonfun$1.apply(ClassUtils.scala:25)
    at com.amazonaws.services.glue.util.ClassUtils$$anonfun$1.apply(ClassUtils.scala:25)
    at scala.util.Try$.apply(Try.scala:192)
    at com.amazonaws.services.glue.util.ClassUtils$.loadByFullName(ClassUtils.scala:25)
    ... 28 more

I solved the issue. 我解决了这个问题。 I had specified the wrong connectionType for 'getSource': it should be "parquet" and not "s3": 我为'getSource'指定了错误的connectionType:它应该是“parquet”而不是“s3”:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.sql.SparkSession

val gc: GlueContext = new GlueContext(sc)

val spark_session : SparkSession = gc.getSparkSession

val source = gc.getSource("parquet", JsonOptions(Map("paths" -> Set("s3://path-to-parquet"))))

source.getDynamicFrame().show(1)

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-parquet https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-parquet

Hopefully this helps somebody! 希望这有助于某人!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM