简体   繁体   中英

Hadoop 3 gcs-connector doesn't work properly with latest version of spark 3 standalone mode

I wrote a simple Scala application which reads a parquet file from GCS bucket. The application uses:

  • JDK 17
  • Scala 2.12.17
  • Spark SQL 3.3.1
  • gcs-connector of hadoop3-2.2.7

The connector is taken from Maven, imported via sbt (Scala build tool). I'm not using the latest, 2.2.9, version because of this issue .

The application works perfectly in local mode, so I tried to switch to the standalone mode.

What I did is these steps:

  1. Downloaded Spark 3.3.1 from here
  2. Started the cluster manually like here

I tried to run the application again and faced this error:

[error] Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error]         at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
[error]         at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
[error]         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
[error]         at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
[error]         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
[error]         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
[error]         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
[error]         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
[error]         at org.apache.parquet.hadoop.util.HadoopInputFile.fromStatus(HadoopInputFile.java:44)
[error]         at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
[error]         at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:484)
[error]         ... 14 more
[error] Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error]         at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
[error]         at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
[error]         ... 24 more

Somehow it cannot detect connector's file system: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

My spark configuration is pretty basic:

spark.app.name = "Example app"
spark.master = "spark://YOUR_SPARK_MASTER_HOST:7077"
spark.hadoop.fs.defaultFS = "gs://YOUR_GCP_BUCKET"
spark.hadoop.fs.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
spark.hadoop.fs.AbstractFileSystem.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
spark.hadoop.google.cloud.auth.service.account.enable = true
spark.hadoop.google.cloud.auth.service.account.json.keyfile = "src/main/resources/gcp_key.json"

I ve found out that the maven version of GCS hadoop connector, is missing dependecies internally.

Ive fixed it by either:

to resolve the second option, I did unpack the gcs hadoop connector jar file, looked for the pom.xml, copy dependencies to a new stand alone xml file, and download them using mvn dependency:copy-dependencies -DoutputDirectory=/path/to/pyspark/jars/ command

here is example pom.xml that Ive created, please note I am using the 2.2.9 version of the connector

<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <name>TMP_PACKAGE_NAME</name>
    <description>
        jar dependencies of gcs hadoop connector
    </description>
    <!--'com.google.oauth-client:google-oauth-client:jar:1.34.1'
    -->
    <groupId>TMP_PACKAGE_GROUP</groupId>
    <artifactId>TMP_PACKAGE_NAME</artifactId>
    <version>0.0.1</version>
    <dependencies>

<dependency>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>gcs-connector</artifactId>
            <version>hadoop3-2.2.9</version>
        </dependency>

        <dependency>
            <groupId>com.google.api-client</groupId>
            <artifactId>google-api-client-jackson2</artifactId>
            <version>2.1.0</version>
        </dependency>

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>31.1-jre</version>
        </dependency>
        <dependency>
            <groupId>com.google.oauth-client</groupId>
            <artifactId>google-oauth-client</artifactId>
            <version>1.34.1</version>
        </dependency>

        <dependency>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>util</artifactId>
            <version>2.2.9</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>util-hadoop</artifactId>
            <version>hadoop3-2.2.9</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud.bigdataoss</groupId>
            <artifactId>gcsio</artifactId>
            <version>2.2.9</version>
        </dependency>
        <dependency>
            <groupId>com.google.auto.value</groupId>
            <artifactId>auto-value-annotations</artifactId>
            <version>1.10.1</version>
            <scope>runtime</scope>
        </dependency>

        <dependency>
            <groupId>com.google.flogger</groupId>
            <artifactId>flogger</artifactId>
            <version>0.7.4</version>
        </dependency>

        <dependency>
            <groupId>com.google.flogger</groupId>
            <artifactId>google-extensions</artifactId>
            <version>0.7.4</version>
        </dependency>

        <dependency>
            <groupId>com.google.flogger</groupId>
            <artifactId>flogger-system-backend</artifactId>
            <version>0.7.4</version>
        </dependency>

        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.10</version>
        </dependency>

    </dependencies>
</project>

hope this helps

This is caused by the fact that Spark uses an old Guava library version and you used a non-shaded GCS connector jar. To make it work, you just need to use shaded GCS connector jar from Maven, for example: https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.9/gcs-connector-hadoop3-2.2.9-shaded.jar

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM