简体   繁体   English

在 flink 集群上远程调试 apache beam 作业

[英]Remote debug an apache beam job on flink cluster

I am running a streaming beam job on a flink cluster where I am getting the following exception.我正在 flink 集群上运行流式传输光束作业,但出现以下异常。

Caused by: org.apache.beam.sdk.util.UserCodeException: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator
        at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:34)
        at org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
        at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:218)
        at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:183)
        at org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:62)
        at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:544)
        at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
        at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator
        at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:596)
        at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
        at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
        at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
        at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
        at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:941)
        at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:895)
        at org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:252)
        at org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:74)
        at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:576)
        at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)
        at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:139)
Caused by: org.apache.beam.sdk.util.UserCodeException: java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds have the same scheme, but received alluxio, file.
        at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:34)
        at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn$DoFnInvoker.invokeProcessElement(Unknown Source)
        at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:218)
        at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:183)
        at org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:62)
        at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:544)
        at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:579)
        at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
        at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
        at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
        at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
        at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.emit(DoFnOperator.java:941)
        at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator$BufferedOutputManager.output(DoFnOperator.java:895)
        at org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:252)
        at org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:74)
        at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:576)
        at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)
        at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:139)
        at org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
        at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:218)
        at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:183)
        at org.apache.beam.runners.flink.metrics.DoFnRunnerWithMetricsUpdate.processElement(DoFnRunnerWithMetricsUpdate.java:62)
        at org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator.processElement(DoFnOperator.java:544)
        at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
        at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds have the same scheme, but received alluxio, file.
        at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
        at org.apache.beam.sdk.io.FileSystems.validateSrcDestLists(FileSystems.java:428)
        at org.apache.beam.sdk.io.FileSystems.rename(FileSystems.java:308)
        at org.apache.beam.sdk.io.FileBasedSink$WriteOperation.moveToOutputFiles(FileBasedSink.java:755)
        at org.apache.beam.sdk.io.WriteFiles$FinalizeTempFileBundles$FinalizeFn.process(WriteFiles.java:850)

The streaming job is getting data from the apache pulsar source and writing output data onto an Alluxio data lake in parquet file format.流作业正在从 apache pulsar 源获取数据,并将 output 数据以 parquet 文件格式写入 Alluxio 数据湖。 I am using Spotify's scio for writing this job in Scala. A little code chunk to emphasize what I am trying to achieve:我正在使用 Spotify 的 scio 在 Scala 中编写这份工作。一小段代码来强调我想要实现的目标:

    pulsarSource
      .open(sc)
      .withFixedWindows(Duration.standardSeconds(windowDuration))
      .toSinkTap(sink)

From the exception, I can see that source and output paths should have the same URI scheme but I don't know how it is happening because I am using an alluxio path as an output directory.从异常中,我可以看到 source 和 output 路径应该具有相同的 URI 方案,但我不知道它是如何发生的,因为我使用 alluxio 路径作为 output 目录。 There are some temp directories that are being created on alluxio output directory but after the WindowDuration , when the output file is being created, this exception happens.在 alluxio output 目录上创建了一些临时目录,但在WindowDuration之后,当创建 output 文件时,会发生此异常。 I had a doubt that temp location might be configured by default to the local filesystem, so I did set that to output directory path (alluxio dir path) but it didn't change anything.我怀疑临时位置可能默认配置为本地文件系统,所以我确实将其设置为 output 目录路径(alluxio 目录路径),但它没有改变任何东西。

sc.options.setTempLocation(outputDir)

I want to do remote debugging in order to figure out the issue.我想进行远程调试以找出问题所在。 I have tried this document to do remote debugging on the task executor node, but once my IntelliJ IDE connects with the node, I don't get hit on my breakpoint.我已经尝试使用此文档在任务执行器节点上进行远程调试,但是一旦我的 IntelliJ IDE 连接到该节点,我就不会在断点上命中。

Can someone suggest, how can I debug or get more information about this issue.有人可以建议我如何调试或获取有关此问题的更多信息。 Thanks谢谢

Remote debugging might be quite hard, but let's try this first: Make sure you connect to the task manager and not job manager (easy to verify with thread names).远程调试可能非常困难,但让我们先尝试一下:确保连接到任务管理器而不是作业管理器(很容易通过线程名称进行验证)。 Then make sure to have a high number of retries, such that you don't miss the task execution, as attaching the debugging might take a while.然后确保有大量的重试,这样你就不会错过任务执行,因为附加调试可能需要一段时间。

It's also helpful to double check that line numbers of the stack trace match your code version in the IDE. If Flink/Beam is preinstalled, they might run a slightly different version and your break point is void.仔细检查堆栈跟踪的行号是否与您在 IDE 中的代码版本相匹配也很有帮助。如果预装了 Flink/Beam,它们可能会运行略有不同的版本,并且您的断点无效。 Just paste the stack trace in your IDE and check if each line matches the expectation.只需将堆栈跟踪粘贴到您的 IDE 中,然后检查每一行是否符合预期。 Finally, add a few more breakpoint at central places like org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202) to make sure if the setup is working at all.最后,在org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)等中心位置添加更多断点,以确保设置是否正常工作。

However, remote debugging is usually not the recommended option for big data systems.然而,远程调试通常不是大数据系统的推荐选项。 You'd first ensure locally that most of the things work on their own with some IT tests and local runners.您首先要在本地确保大多数事情通过一些 IT 测试和本地运行器自行运行。 Then, you might want to add e2e tests with docker containers and a local mini cluster.然后,您可能想要使用 docker 个容器和一个本地迷你集群添加 e2e 测试。 Additionally, you'd add a ton of logging statements, which you can turn on and off with your logging configuration.此外,您将添加大量日志记录语句,您可以使用日志记录配置打开和关闭它们。 Similarly, if you set logging level to debug, the existing log statements of the frameworks might already be enough to gain some insights.同样,如果您将日志记录级别设置为调试,则框架的现有日志语句可能已经足以获得一些见解。 One important thing that you should always look at is the generated topology that you can see in Web UI.您应该始终关注的一件重要事情是生成的拓扑,您可以在 Web UI 中看到它。 Maybe it already tells you the paths in question.也许它已经告诉你有问题的路径。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM