简体   繁体   中英

Cloud Dataproc can't access Cloud Storage bucket

I have a cloud dataproc Spark job that also uses Cloud Strage API from Drvier side (to choose specific files from the same folder to work with).

Here are maven dependencies:

<dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>2.4.4</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-storage</artifactId>
            <version>1.101.0</version>
        </dependency>
    </dependencies>

Here is the simplest version of the code that fails:

import com.google.cloud.storage._

object Test {
  def main(args: Array[String]): Unit = {
    val storage = StorageOptions.getDefaultInstance().getService()
--> storage.list("intent_raw")
  }
}

here is stacktrace:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
    at com.google.api.gax.retrying.BasicRetryingFuture.<init>(BasicRetryingFuture.java:84)
    at com.google.api.gax.retrying.DirectRetryingExecutor.createFuture(DirectRetryingExecutor.java:88)
    at com.google.api.gax.retrying.DirectRetryingExecutor.createFuture(DirectRetryingExecutor.java:74)
    at com.google.cloud.RetryHelper.run(RetryHelper.java:75)
    at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
    at com.google.cloud.storage.StorageImpl.listBlobs(StorageImpl.java:372)
    at com.google.cloud.storage.StorageImpl.list(StorageImpl.java:328)
--> at ai.mandal.cloud.dataproc.Test$.main(Test.scala:14)
    at ai.mandal.cloud.dataproc.Test.main(Test.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

My question is generally what can cause it, and also if I am running it from a dataproc service (which has access to the bucket), do I need to configure separate credentials for that.

The solution was to add

spark.executor.userClassPathFirst = true
spark.driver.userClassPathFirst = true

to job properties.

The problem is caused by conflicting versions of guava found in google-cloud-storage and the host environment.

Google recommends to shade the conflicting guava in your dependency, I've tried that too but that didn't work for this case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM