Apache 光束 SIGKILL

Question

The Question问题

How do I best execute memory-intensive pipelines in Apache Beam?如何在 Apache Beam 中最好地执行内存密集型管道？

Background背景

I've written a pipeline that takes the Naemura Bird dataset and converts the images and annotations to TF Records with TF Examples of the required format for the TF object detection API.我编写了一个管道，该管道采用Naemura Bird 数据集并将图像和注释转换为 TF 记录，其中包含 TF object 检测 API 所需格式的示例。

I tested the pipeline using DirectRunner with a small subset of images (4 or 5) and it worked fine.我使用 DirectRunner 和一小部分图像（4 个或 5 个）测试了管道，它运行良好。

The Problem问题

When running the pipeline with a bigger data set (day 1 of 3, ~21GB) it crashes after a while with a non-descriptive SIGKILL .当使用更大的数据集（第 1 天，共 3 天，~21GB）运行管道时，它会在一段时间后因非描述性SIGKILL崩溃。 I do see a memory peak before the crash and assume that the process is killed because of a too high memory load.我确实在崩溃前看到了 memory 峰值，并假设该进程由于 memory 负载过高而被终止。

I ran the pipeline through strace .我通过strace运行管道。 These are the last lines in the trace:这些是跟踪中的最后几行：

[pid 53702] 10:00:09.105069 poll([{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}], 11, 100) = 0 (Timeout)
[pid 53702] 10:00:09.205826 poll([{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}], 11, 100 <unfinished ...>
[pid 53534] 10:00:09.259806 mmap(NULL, 63082496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3aa43d7000
[pid 53694] 10:00:09.297140 <... clock_nanosleep resumed>NULL) = 0
[pid 53694] 10:00:09.297273 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=200000000},  <unfinished ...>
[pid 53702] 10:00:09.306409 <... poll resumed>) = 0 (Timeout)
[pid 53702] 10:00:09.306478 poll([{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}], 11, 100) = 0 (Timeout)
[pid 53702] 10:00:09.406866 poll([{fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}], 11, 100 <unfinished ...>
[pid 53710] 10:03:55.844910 <... futex resumed>) = ?
[pid 53709] 10:03:57.797618 <... futex resumed>) = ?
[pid 53708] 10:03:57.797737 <... futex resumed>) = ?
[pid 53707] 10:03:57.797793 <... futex resumed>) = ?
[pid 53706] 10:03:57.797847 <... futex resumed>) = ?
[pid 53705] 10:03:57.797896 <... futex resumed>) = ?
[pid 53704] 10:03:57.797983 <... futex resumed>) = ?
[pid 53703] 10:03:57.798035 <... futex resumed>) = ?
[pid 53702] 10:03:57.798085 +++ killed by SIGKILL +++
[pid 53701] 10:03:57.798124 <... futex resumed>) = ?
[pid 53700] 10:03:57.798173 <... futex resumed>) = ?
[pid 53699] 10:03:57.798224 <... futex resumed>) = ?
[pid 53698] 10:03:57.798272 <... futex resumed>) = ?
[pid 53697] 10:03:57.798321 <... accept4 resumed> <unfinished ...>) = ?
[pid 53694] 10:03:57.798372 <... clock_nanosleep resumed> <unfinished ...>) = ?
[pid 53693] 10:03:57.798426 <... futex resumed>) = ?
[pid 53660] 10:03:57.798475 <... futex resumed>) = ?
[pid 53641] 10:03:57.798523 <... futex resumed>) = ?
[pid 53640] 10:03:57.798572 <... futex resumed>) = ?
[pid 53639] 10:03:57.798620 <... futex resumed>) = ?
[pid 53710] 10:03:57.798755 +++ killed by SIGKILL +++
[pid 53709] 10:03:57.798792 +++ killed by SIGKILL +++
[pid 53708] 10:03:57.798828 +++ killed by SIGKILL +++
[pid 53707] 10:03:57.798864 +++ killed by SIGKILL +++
[pid 53706] 10:03:57.798900 +++ killed by SIGKILL +++
[pid 53705] 10:03:57.798937 +++ killed by SIGKILL +++
[pid 53704] 10:03:57.798973 +++ killed by SIGKILL +++
[pid 53703] 10:03:57.799008 +++ killed by SIGKILL +++
[pid 53701] 10:03:57.799044 +++ killed by SIGKILL +++
[pid 53700] 10:03:57.799079 +++ killed by SIGKILL +++
[pid 53699] 10:03:57.799116 +++ killed by SIGKILL +++
[pid 53698] 10:03:57.799152 +++ killed by SIGKILL +++
[pid 53697] 10:03:57.799187 +++ killed by SIGKILL +++
[pid 53694] 10:03:57.799245 +++ killed by SIGKILL +++
[pid 53693] 10:03:57.799282 +++ killed by SIGKILL +++
[pid 53660] 10:03:57.799318 +++ killed by SIGKILL +++
[pid 53641] 10:03:57.799354 +++ killed by SIGKILL +++
[pid 53640] 10:03:57.799390 +++ killed by SIGKILL +++
[pid 53639] 10:03:57.910349 +++ killed by SIGKILL +++
10:03:57.910381 +++ killed by SIGKILL +++

Answer 1

Multiple things could cause this behaviour, because the pipeline runs fine with less Data, analysing what has changed could lead us to a resolution.多种情况可能会导致这种行为，因为管道运行良好，数据较少，分析已更改的内容可能会导致我们找到解决方案。

Option 1: clean your input data选项 1：清理输入数据

The third line of the logs you provide might indicate that you're processing unclean data in your bigger pipeline mmap(NULL, could mean that | "Get Content" >> beam.Map(lambda x: x.read_utf8()) is trying to read a null value.您提供的日志的第三行可能表明您正在更大的管道中处理不干净的数据mmap(NULL,可能意味着| "Get Content" >> beam.Map(lambda x: x.read_utf8())正在尝试读取 null 值。

Is there an empty file somewhere?某处有空文件吗？ Are your files utf8 encoded?你的文件是 utf8 编码的吗？

Option 2: use smaller files as input选项 2：使用较小的文件作为输入

I'm guessing using the fileio.ReadMatches() will try to load into memory the whole file, if your file is bigger than your memory, this could lead to errors.我猜使用fileio.ReadMatches()会尝试将整个文件加载到 memory 中，如果您的文件大于 memory，这可能会导致错误。 Can you split your data into smaller files?您可以将数据拆分为较小的文件吗？

Option 3: use a bigger infrastructure选项 3：使用更大的基础设施

If files are too big for your current machine with a DirectRunner you could try to use an on-demand infrastructure using another runner on the Cloud such as DataflowRunner如果文件对于您当前使用DirectRunner的机器来说太大了，您可以尝试使用云上的另一个运行程序（例如DataflowRunner ）来使用按需基础架构

Apache 光束 SIGKILL

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-06-11 14:34:30

Option 1: clean your input data选项 1：清理输入数据

Option 2: use smaller files as input选项 2：使用较小的文件作为输入

Option 3: use a bigger infrastructure选项 3：使用更大的基础设施

Apache 光束 SIGKILL

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-06-11 14:34:30

Option 1: clean your input data选项 1：清理输入数据

Option 2: use smaller files as input选项 2：使用较小的文件作为输入

Option 3: use a bigger infrastructure选项 3：使用更大的基础设施

解决方案1
0 已采纳 2021-06-11 14:34:30