编写STDOUT时，python子进程模块因spark-submit命令而挂起

Question

I have a python script that is used to submit spark jobs using the spark-submit tool. 我有一个python脚本，用于使用spark-submit工具提交spark作业。 I want to execute the command and write the output both to STDOUT and a logfile in real time. 我想执行命令并将输出实时写入STDOUT和日志文件。 i'm using python 2.7 on a ubuntu server. 我在ubuntu服务器上使用python 2.7。

This is what I have so far in my SubmitJob.py script 这就是我到目前为止的SubmitJob.py脚本

#!/usr/bin/python

# Submit the command
def submitJob(cmd, log_file):
    with open(log_file, 'w') as fh:
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print output.strip()
                fh.write(output)
        rc = process.poll()
        return rc

if __name__ == "__main__":
    cmdList = ["dse", "spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]
    log_file = "/tmp/out.log"
    exist_status = submitJob(cmdList, log_file)
    print "job finished with status ",exist_status

The strange thing is, when I execute the same command direcly in the shell it works fine and produces output on screen as the proggram proceeds. 奇怪的是，当我在shell中直接执行同一命令时，它运行良好，并随着程序的进行在屏幕上产生输出。

So it looks like something is wrong in the way I'm using the subprocess.PIPE for stdout and writing the file. 因此，看来我在使用subprocess.PIPE进行stdout和写入文件时出现了问题。

What's the current recommended way to use subprocess module for writing to stdout and log file in real time line by line? 当前推荐的使用子流程模块逐行实时写入stdout和日志文件的方法是什么？ I see bunch of options on the internet but not sure which is correct or latest. 我在互联网上看到了很多选项，但不确定哪个正确或最新。

thanks 谢谢

Answer 1

Figured out what the problem was. 找出问题所在。 I was trying to redirect both stdout n stderr to pipe to display on screen. 我试图将两个stdout n stderr重定向到管道以在屏幕上显示。 This seems to block the stdout when stderr is present. 存在stderr时，这似乎会阻止stdout。 If I remove the stderr=stdout argument from Popen, it works fine. 如果我从Popen中删除stderr = stdout参数，则可以正常工作。 So for spark-submit it looks like you don't need to redirect stderr explicitly as it already does this implicitly 因此，对于spark-submit，看起来您不需要显式重定向stderr，因为它已经隐式执行了此操作

Answer 2

To print the Spark log One can call the commandList given by user330612 要打印Spark日志，可以调用user330612给定的commandList

  cmdList = ["spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]

Then it can be printed by using subprocess, remember to use communicate() to prevent deadlocks https://docs.python.org/2/library/subprocess.html Warning Deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. 然后可以使用子进程进行打印，请记住使用通讯（）防止死锁https://docs.python.org/2/library/subprocess.html警告当使用stdout = PIPE和/或stderr = PIPE和子进程会向管道生成足够的输出，从而阻止进程等待OS管道缓冲区接受更多数据。 Use communicate() to avoid that. 使用communication（）可以避免这种情况。 Here below is the code to print the log. 下面是打印日志的代码。

import subprocess
p = subprocess.Popen(cmdList,stdout=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr = p.communicate() 
stderr=stderr.splitlines()
stdout=stdout.splitlines()
for line in stderr:
    print line  #now it can be printed line by line to a file or something else, for the log
for line in stdout:
    print line #for the output

More information about subprocess and printing lines can be found at: https://pymotw.com/2/subprocess/ 有关子流程和打印行的更多信息，请参见： https : //pymotw.com/2/subprocess/

编写STDOUT时，python子进程模块因spark-submit命令而挂起

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-10-14 15:29:52

解决方案2
0 2017-02-28 14:06:55

编写STDOUT时，python子进程模块因spark-submit命令而挂起

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-10-14 15:29:52

解决方案2 0 2017-02-28 14:06:55

解决方案1
3 已采纳 2016-10-14 15:29:52

解决方案2
0 2017-02-28 14:06:55