简体   繁体   中英

How to run a Scio pipeline on Dataflow from SBT (local)

I am trying to run my first Scio pipeline on Dataflow .

The code in question can be found here . However I do not think that is too important.
My first experiment was to read some local CSV files and write another local CSV file, using the DirecRunner . That worked as expected.

Now, I am trying to read the files from GCS , write the output to BigQuery and run the pipeline using the DataflowRunner . I already made all the necessary changes (or that is what I believe) . But I am unable to make it run.

I already gcloud auth application-default login and when I do

sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table

I can see the Jb is submitted in Dataflow . However, after one hour the jobs fails with the following error message.

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.

(Note, the job did nothing in all that time, and since this is an experiment the data is simple too small to take more than a couple of minutes) .

Checking the StackDriver I can find the follow error:

java.lang.ClassNotFoundException: scala.collection.Seq

Related to some jackson thing:

java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.scala.DefaultScalaModule could not be instantiated

And that is what is killing each executor just at the start. I really do not understand why I can not find the Scala standard library.

I also tried to first create a template and runt it latter with:

sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table --stagingLocation=gs://path/to/staging --templateLocation=gs://path/to/templates/template-1

But, after running the template, I get the same error.
Also, I noticed that in the staging folder there are a lot of jars, but the scala-library.jar is not in there.

I am missing something obvious?

It's a known issue with sbt 1.3.0 which introduced some breaking change wrt class loaders. Try 1.2.8?

Also the Jackson issue is probably related to Java 11 or above. Stay with Java 8 for now.

Fix by setting the sbtclassLoaderLayeringStrategy :

run / classLoaderLayeringStrategy := ClassLoaderLayeringStrategy.Flat

sbt uses a new classloader for the application that is run with run . This causes other classes already loaded by the JVM (Predef for instance) to be reused, reducing startup time. Seein-process classloaders for details.

This doesn't play well with the Beam DataflowRunner because it explicitly does not stage classes from parent classloaders, see PipelineResources.java#L51 :

Attempts to detect all the resources the class loader has access to. This does not recurse to class loader parents stopping it from pulling in resources from the system class loader.

So the fix is to force all classes used by your application to be loaded in the same classloader so that DataflowRunner stages everything.

Hope that helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM