Apache Beam upgrade issue

Question

I recently upgraded my project's apache <beam.version> from 2.19 to 2.34.

with current maven configuration as below:

...
<build>
<plugins>
<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <version>3.2.0</version>
  <configuration>
    <source>1.8</source>
    <target>1.8</target>
...
<plugin>
  <artifactId>maven-compiler-plugin</artifactId>
  <version>3.8.0</version>
</plugin>
...
<properties>
      <beam.version>2.34.0</beam.version>
      <hbase.version>1.2.0-cdh5.8.2</hbase.version>
      <guava.version>31.0.1-jre</guava.version>
</properties>
<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>libraries-bom</artifactId>
            <version>24.0.0</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>
...

arguments as follow:

--project=*masked* \
--environment=*masked* \
--tableName=*masked* \
--outputFolder=gs://*masked* \
--stagingLocation=gs://*masked* \
--tempLocation=gs://*masked* \
--serviceAccount=dataflow-instance@*masked* \
--region=asia-southeast1 \
--flexRSGoal=COST_OPTIMIZED \
--maxNumWorkers=50 \
--workerMachineType=n1-standard-2 \
--startDate=${start_date}" 00:00" \
--endDate=${end_date}" 00:00" \
--jobName=*masked*-${start_date_no_dash}

However, it gave me a lot of warnings and ended with errors when I trigger the job.

2021-12-14T17:29:01.589ZCan't verify serialized elements of type SerializableConfiguration have well defined equals method. This may produce incorrect results on some PipelineRunner

2021-12-14T17:29:02.192ZUnable to load native-hadoop library for your platform... using builtin-java classes where applicable

2021-12-14T17:31:01.970ZSplitting source org.apache.beam.sdk.io.hbase.HBaseIO$HBaseSource@1469295f into bundles of estimated size 939979010211 bytes produced 12180 bundles, which have total serialized size 500898593 bytes. As this is too large for the Google Cloud Dataflow API, retrying splitting once with increased desiredBundleSizeBytes 44902244917318 to reduce the number of splits.

2021-12-14T17:32:54.037ZSplitting source org.apache.beam.sdk.io.hbase.HBaseIO$HBaseSource@1469295f into bundles of estimated size 44902244917318 bytes produced 12181 bundles. Rebundling into 100 bundles.

2021-12-14T17:34:01.513ZOperation ongoing in step Read from HBase/Read(HBaseSource) for at least 05m00s without outputting or completing in state split at org.xerial.snappy.SnappyNative.rawCompress(Native Method) at org.xerial.snappy.Snappy.rawCompress(Snappy.java:450) at org.xerial.snappy.Snappy.compress(Snappy.java:123) at org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:380) at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:130) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1848) at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1848) at java.io.ObjectOutputStream.write(ObjectOutputStream.java:709) at org.apache.hadoop.io.WritableUtils.writeCompressedByteArray(WritableUtils.java:75) at org.apache.hadoop.io.WritableUtils.writeCompressedString(WritableUtils.java:94) at org.apache.hadoop.io.WritableUtils.writeCompressedStringArray(WritableUtils.java:155) at org.apache.hadoop.conf.Configuration.write(Configuration.java:2974) at org.apache.beam.sdk.io.hadoop.SerializableConfiguration.writeExternal(SerializableConfiguration.java:58) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.beam.sdk.coders.SerializableCoder.encode(SerializableCoder.java:189) at org.apache.beam.sdk.io.hbase.HBaseIO$Read$SerializationProxy.writeObject(HBaseIO.java:317) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:55) at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.serializeSplitToCloudSource(WorkerCustomSources.java:151) at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.wrapIntoSourceSplitResponse(WorkerCustomSources.java:320) at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:265) at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:201) at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:180) at org.apache.beam.runners.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:82) at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:420) at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:389) at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:314) at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140) at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120) at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

2021-12-14T17:34:29.778517630ZError message from worker: java.lang.IllegalArgumentException: Total size of the BoundedSource objects generated by split() operation is larger than the allowable limit. When splitting org.apache.beam.sdk.io.hbase.HBaseIO$HBaseSource@1469295f into bundles of 44902244917318 bytes it generated 12181 BoundedSource objects with total serialized size of 493647801 bytes which is larger than the limit 20971520. For more information, please check the corresponding FAQ entry at https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:286) org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:201) org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:180) org.apache.beam.runners.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:82) org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:420) org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:389) org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:314) org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140) org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120) org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107) java.util.concurrent.FutureTask.run(FutureTask.java:266) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748)

2021-12-14T17:50:04.082759654ZWorkflow failed. Causes: S03:Read from HBase/Read(HBaseSource)+Map Result to Record+Save To Parquet/WriteFiles/WriteUnshardedBundlesToTempFiles/WriteUnshardedBundles+Save To Parquet/WriteFiles/GatherTempFileResults/Add void key/AddKeys/Map+Save To Parquet/WriteFiles/GatherTempFileResults/Reshuffle/Window.Into()/Window.Assign+Save To Parquet/WriteFiles/GatherTempFileResults/Reshuffle/GroupByKey/Reify+Save To Parquet/WriteFiles/GatherTempFileResults/Reshuffle/GroupByKey/Write+Save To Parquet/WriteFiles/WriteUnshardedBundlesToTempFiles/GroupUnwritten/Reify+Save To Parquet/WriteFiles/WriteUnshardedBundlesToTempFiles/GroupUnwritten/Write failed., Internal Issue (b1d798680dfd97cb): 63963027:24514

Exception in thread "main" java.lang.RuntimeException: Failed to create a workflow job: The size of the serialized JSON representation of the pipeline exceeds the allowable limit. For more information, please see the documentation on job submission:
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#jobs
        at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:1241)
        at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:197)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
        at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
        at *masked*.App.run(App.java:233)
        at *masked*.App.main(App.java:240)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
POST https://dataflow.googleapis.com/*masked*
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "(288bc232db1093b3): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "reason" : "badRequest"
  } ],
  "message" : "(288bc232db1093b3): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
  "status" : "INVALID_ARGUMENT"
}
        at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
        at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
        at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
        at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
        at org.apache.beam.runners.dataflow.DataflowClient.createJob(DataflowClient.java:64)
        at org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:1227)
        ... 5 more

Due to this error during run, I still maintain the beam.version on 2.19. Anyone have any idea how to fix this, please?

Answer 1

apparently, the above issue was solved by just adding the below flag, no codes were tuned/modified. I have no other words to say...

--experiments=use_runner_v2

link as reference: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2

Apache Beam upgrade issue

Question

1 answers

solution1
0 2022-01-03 10:22:18

Apache Beam upgrade issue

Question

1 answers

solution1 0 2022-01-03 10:22:18

solution1
0 2022-01-03 10:22:18