为什么我在使用 Dataflow 管道时遇到“错误同步 pod”？

Question

I experiment a weird error with my Dataflow pipeline when I want to use specific library from PyPI.当我想使用 PyPI 中的特定库时，我用我的 Dataflow 管道试验了一个奇怪的错误。

I need jsonschema in a ParDo, so, in my requirements.txt file, I added jsonschema==3.2.0 .我需要jsonschema中的 jsonschema，因此，在我的requirements.txt文件中，我添加了jsonschema==3.2.0 。 I launch my pipeline with the command line below:我使用下面的命令行启动我的管道：

python -m gcs_to_all \
    --runner DataflowRunner \
    --project <my-project-id> \
    --region europe-west1 \
    --temp_location gs://<my-bucket-name>/temp/ \
    --input_topic "projects/<my-project-id>/topics/<my-topic>" \
    --network=<my-network> \
    --subnetwork=<my-subnet> \
    --requirements_file=requirements.txt \
    --experiments=allow_non_updatable_job \
    --streaming

In the terminal, all seems to be good:在终端中，一切似乎都很好：

INFO:root:2020-01-03T09:18:35.569Z: JOB_MESSAGE_BASIC: Worker configuration: n1-standard-4 in europe-west1-b.
INFO:root:2020-01-03T09:18:35.806Z: JOB_MESSAGE_WARNING: The network default doesn't have rules that open TCP ports 12345-12346 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: Firewall rules associated with your network don't open TCP ports 12345-12346 for Dataflow instances. If a firewall rule opens connection in these ports, ensure target tags aren't specified, or that the rule includes the tag 'dataflow'.
INFO:root:2020-01-03T09:18:48.549Z: JOB_MESSAGE_DETAILED: Workers have started successfully.

Where's no error in the log tab on Dataflow webpage, but in stackdriver: Dataflow网页上的日志选项卡中没有错误，但在stackdriver中：

message: "Error syncing pod 6515c378c6bed37a2c0eec1fcfea300c ("<dataflow-id>--01030117-c9pc-harness-5lkv_default(6515c378c6bed37a2c0eec1fcfea300c)"), skipping: [failed to "StartContainer" for "sdk0" with CrashLoopBackOff: "Back-off 10s restarting failed container=sdk0 pod=<dataflow-id>--01030117-c9pc-harness-5lkv_default(6515c378c6bed37a2c0eec1fcfea300c)""
message: ", failed to "StartContainer" for "sdk1" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=sdk1 pod=<dataflow-id>--01030117-c9pc-harness-5lkv_default(6515c378c6bed37a2c0eec1fcfea300c)"" 
message: ", failed to "StartContainer" for "sdk2" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=sdk2 pod=<dataflow-id>--01030117-c9pc-harness-5lkv_default(6515c378c6bed37a2c0eec1fcfea300c)"" 
message: ", failed to "StartContainer" for "sdk3" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=sdk3 pod=<dataflow-id>--01030117-c9pc-harness-5lkv_default(6515c378c6bed37a2c0eec1fcfea300c)""

I find this error too (in info mode):我也发现了这个错误（在信息模式下）：

Collecting jsonschema (from -r /var/opt/google/staged/requirements.txt (line 1))
  Installing build dependencies: started
Looking in links: /var/opt/google/staged
  Installing build dependencies: started
Collecting jsonschema (from -r /var/opt/google/staged/requirements.txt (line 1))
  Installing build dependencies: started
Looking in links: /var/opt/google/staged
Collecting jsonschema (from -r /var/opt/google/staged/requirements.txt (line 1))
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /usr/local/bin/python3 /usr/local/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-mdurhav9/overlay --no-warn-script-location --no-binary :none: --only-binary :none: --no-index --find-links /var/opt/google/staged -- 'setuptools>=40.6.0' wheel
       cwd: None
  Complete output (5 lines):
  Looking in links: /var/opt/google/staged
  Collecting setuptools>=40.6.0
  Collecting wheel
    ERROR: Could not find a version that satisfies the requirement wheel (from versions: none)
  ERROR: No matching distribution found for wheel

But I don't know why it can get this dependency...但我不知道为什么它可以得到这种依赖......

Do you have any idea how I can debug this?你知道我该如何调试吗？ or why I encounter this error?或者为什么我会遇到这个错误？

Thanks谢谢

Answer 1

When Dataflow workers start, they execute several steps:当 Dataflow Worker 启动时，它们会执行几个步骤：

Install packages from requirements.txt从requirements.txt安装包
Install packages specified as extra_packages安装指定为extra_packages软件包
Install workflow tarball and execute actions provided in setup.py .安装工作流 tarball 并执行setup.py提供的操作。

Error syncing pod with CrashLoopBackOff message can be related to dependency conflict.使用CrashLoopBackOff消息Error syncing pod时CrashLoopBackOff可能与依赖冲突有关。 You need to verify that there are no conflicts with the libraries and versions used for the job.您需要验证与用于作业的库和版本没有冲突。 Please refer to the documentation for staging required dependencies of the pipeline.请参阅有关暂存管道所需依赖项的文档。

Also, take a look for preinstalled dependencies and this StackOverflow thread .另外，查看预安装的依赖项和这个StackOverflow 线程。

What you can try is change the version of jsonschema and try run it again.您可以尝试更改jsonschema的版本并尝试再次运行它。 If it wouldn't help, please provide requirements.txt file.如果没有帮助，请提供requirements.txt文件。

I hope it will help you.我希望它会帮助你。

Answer 2

There is a playbook for this error: https://cloud.google.com/dataflow/docs/guides/common-errors#error-syncing-pod这个错误有一个剧本： https://cloud.google.com/dataflow/docs/guides/common-errors#error-syncing-pod

为什么我在使用 Dataflow 管道时遇到“错误同步 pod”？

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-01-03 12:58:09

解决方案2
-1 2022-09-15 23:30:09

为什么我在使用 Dataflow 管道时遇到“错误同步 pod”？

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-01-03 12:58:09

解决方案2 -1 2022-09-15 23:30:09

解决方案1
4 已采纳 2020-01-03 12:58:09

解决方案2
-1 2022-09-15 23:30:09