简体   繁体   中英

python subprocess module hangs for spark-submit command when writing STDOUT

I have a python script that is used to submit spark jobs using the spark-submit tool. I want to execute the command and write the output both to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu server.

This is what I have so far in my SubmitJob.py script

#!/usr/bin/python

# Submit the command
def submitJob(cmd, log_file):
    with open(log_file, 'w') as fh:
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print output.strip()
                fh.write(output)
        rc = process.poll()
        return rc

if __name__ == "__main__":
    cmdList = ["dse", "spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]
    log_file = "/tmp/out.log"
    exist_status = submitJob(cmdList, log_file)
    print "job finished with status ",exist_status

The strange thing is, when I execute the same command direcly in the shell it works fine and produces output on screen as the proggram proceeds.

So it looks like something is wrong in the way I'm using the subprocess.PIPE for stdout and writing the file.

What's the current recommended way to use subprocess module for writing to stdout and log file in real time line by line? I see bunch of options on the internet but not sure which is correct or latest.

thanks

Figured out what the problem was. I was trying to redirect both stdout n stderr to pipe to display on screen. This seems to block the stdout when stderr is present. If I remove the stderr=stdout argument from Popen, it works fine. So for spark-submit it looks like you don't need to redirect stderr explicitly as it already does this implicitly

To print the Spark log One can call the commandList given by user330612

  cmdList = ["spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]

Then it can be printed by using subprocess, remember to use communicate() to prevent deadlocks https://docs.python.org/2/library/subprocess.html Warning Deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that. Here below is the code to print the log.

import subprocess
p = subprocess.Popen(cmdList,stdout=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr = p.communicate() 
stderr=stderr.splitlines()
stdout=stdout.splitlines()
for line in stderr:
    print line  #now it can be printed line by line to a file or something else, for the log
for line in stdout:
    print line #for the output 

More information about subprocess and printing lines can be found at: https://pymotw.com/2/subprocess/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM