Multithreaded Perl script leads to broken pipe if called as a Python subprocess

Question

I am calling a Perl script from Python 3.7.3, with subprocess. The Perl script that is called is this one:

https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

And the code I am using to call it is:

import sys
import os
import subprocess
import threading

def copy_out(source, dest):
    for line in source:
        dest.write(line)

num_threads=4

args = ["perl", "tokenizer.perl",
        "-l", "en",
        "-threads", str(num_threads)
       ]

with open(os.devnull, "wb") as devnull:
    tokenizer = subprocess.Popen(args,
        stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=devnull)

tokenizer_thread = threading.Thread(target=copy_out, args=(tokenizer.stdout, open("outfile", "wb")))
tokenizer_thread.start()

num_lines = 100000

for _ in range(num_lines):
    tokenizer.stdin.write(b'Random line.\n')

tokenizer.stdin.close()
tokenizer_thread.join()

tokenizer.wait()

On my system, this leads to the following error:

Traceback (most recent call last):
  File "t.py", line 27, in <module>
    tokenizer.stdin.write(b'Random line.\n')
BrokenPipeError: [Errno 32] Broken pipe

I investigated this, and it turns out that if the -threads argument for the subprocess is 1 the error is not thrown. As I don't want to give up on multithreading in the child process, my question is:

What is causing this error in the first place? "Who" is to blame for it: OS / environment, my Python code, the Perl code?

I am glad to provide more information if needed.

EDIT : To respond to some comments,

Running the Perl script is only possible if you also have this file: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
The Perl script actually processes several thousands of lines before the process fails. In my Python script above, if I make num_lines smaller, I do not get this error anymore.
If I invoke this Perl script simply on the command line, without any Python, it works fine: ~~no matter how many (Perl) threads~~ or lines of input.
My Python variable num_threads only controls the number of threads of the Perl subprocess. I never start several Python threads, just one.

EDIT 2 : In my first edit, I incorrectly stated that this Perl program runs fine when called with eg -threads 4 from the command line: there, a different Perl was used that is compiled with multithreading. If I use the same Perl that is invoked from Python, I get:

$ cat [file with 100000 lines] | [correct perl] tokenizer.perl -l en -threads 4
Can't locate object method "new" via package "Thread" at
tokenizer.perl line 130, <STDIN> line 8000.

Which no doubt would have helped me debug this better.

Answer 1

The problem seems to be that the perl script crashes if perl does not support threads. You can check if your perl supports threads by running:

perl -MConfig -E 'say "Threads supported" if $Config{useithreads}'

In my case, the output was empty so I installed a new perl with thread support:

perlbrew install perl-5.30.0 --as=5.30.0-threads -Dusethreads
perlbrew use 5.30.0-threads

Then I ran the Python script again:

import sys
import os
import subprocess
import threading

def copy_out(source, dest):
    for line in iter(source.readline, b''):
        dest.write(line)

num_threads=4
args = ["perl", "tokenizer.perl",
        "-l", "en",
        "-threads", str(num_threads)
       ]
tokenizer = subprocess.Popen(
    args,
    bufsize=-1,  #use default bufsize = 8192 bytes
    stdin=subprocess.PIPE,
    stdout=subprocess.PIPE,
    stderr=subprocess.DEVNULL)

tokenizer_thread = threading.Thread(
    target=copy_out, args=(tokenizer.stdout, open("outfile", "wb")))
tokenizer_thread.start()

num_lines = 100000

for _ in range(num_lines):
    tokenizer.stdin.write(b'Random line.\n')

tokenizer.stdin.close()
tokenizer_thread.join()
tokenizer.wait()

and it now ran to the end with no errors and produced the output file outfile with 100000 lines.

Answer 2

What is causing this error in the first place?

Writing to a closed pipe causes the OS to send SIGPIPE to the process calling write . This allows program to work as generators. For example, the following won't run forever despite containing an infinite loop, because head will exit and close its STDIN after reading ten lines, leading to perl receiving a SIGPIPE.

perl -le'1 while print ++$i;' | head

If the SIGPIPE signal is being ignored, the write system call will return EPIPE (Broken pipe) instead. The following won't run forever either because print returns error EPIPE once head exits.

perl -le'$SIG{PIPE}="IGNORE"; 1 while print ++$i;' | head

From the fact that your Python program received an EPIPE error, we deduce two facts:

The Python program ignores SIGPIPE signals, and
All handles to the reader end of the pipe were closed.

So we must ask ourselves: Why would the Perl program close its STDIN? it's very unlikely that its STDIN was closed explicitly. By far, the most likely explanation is that the child process was terminated.

"Who" is to blame for it: OS / environment, my Python code, the Perl code?

That depends on what caused the Perl program to exit. The first thing to do is figure out what exit status was returned by the child process. Depending on the exit status, we'll know whether

the process was killed by a signal,
the process exited with an error, or
the process completed successfully.

If the exit code tells us the process was killed by a signal, the exit code will also tells us by which signal. This could give us some information. (This would be the hardest of the three scenarios to debug.)

If the exit code tells us the process returned an error, the error code itself might not contain any additional useful information, but an error message was surely sent to the child's STDERR to provide more information.

If the exit code tells us the process completed successfully, perhaps the arguments or input you are providing don't mean what you think they mean.

So make sure to call tokenizer.wait() to collect the exit status and store it in tokenizer.returncode . Also make sure to log what is being sent to STDERR.

Multithreaded Perl script leads to broken pipe if called as a Python subprocess

Question

2 answers

solution1
3 ACCPTED 2020-04-21 23:10:40

solution2
2 2020-04-22 05:54:09

Multithreaded Perl script leads to broken pipe if called as a Python subprocess

Question

2 answers

solution1 3 ACCPTED 2020-04-21 23:10:40

solution2 2 2020-04-22 05:54:09

solution1
3 ACCPTED 2020-04-21 23:10:40

solution2
2 2020-04-22 05:54:09