How to get the worker logs in spark using the default cluster manager?

Question

I am trying to get the application output of the spark run and cannot find a straightforward way doing that.

Basically I am talking about the content of the <spark install dir>/work directory on the cluster worker.

I could've copied that directory to the location I need, but in case of 100500 nodes it simply doesn't scale.

The other option I was considering is to attach an exit function (like a TRAP in bash) to get the logs from each worker as a part of the app run. I just think there has to be a better solution than that.

Yeah, I know that I can use YARN or Mesos cluster manager to get the logs, however it seems really weird to me that in order to do such a convenient thing I cannot use the default cluster manager.

Thanks a lot.

Answer 1

In the end I went for the following solution (Python):

import os
import tarfile
from io import BytesIO
from pyspark.sql import SparkSession


# Get the spark app.
spark = SparkSession.builder.appName("my-spark-app").getOrCreate()
# Get the executor working directories.
spark_home = os.environ.get('SPARK_HOME')
if spark_home:
    num_workers = 0
    with open(os.path.join(spark_home, 'conf', 'slaves'), 'r') as f:
        for line in f:
            num_workers += 1
    if num_workers:
        executor_logs_path = '/where/to/store/executor_logs'

        def _map(worker):
            '''Returns the list of tuples of the name and the tar.gz of the worker log directory in binary format
            for the corresponding worker.
            '''
            flo = BytesIO()
            with tarfile.open(fileobj=flo, mode="w:gz") as tar:
                tar.add(os.path.join(spark_home, 'work'), arcname='work')
            return [('worker_%d_dir.tar.gz' % worker, flo.getvalue()),]

        def _reduce(worker1, worker2):
            '''Appends the worker name and its log tar.gz's into the list.
            '''
            worker1.extend(worker2)
            return worker1

        os.makedirs(executor_logs_path)
        logs = spark.sparkContext.parallelize(range(num_workers), num_workers).map(_map).reduce(_reduce)
        with tarfile.open(os.path.join(executor_logs_path, 'logs.tar'), 'w') as tar:
            for name, data in logs:
                info = tarfile.TarInfo(name=name)
                info.size=len(data)
                tar.addfile(tarinfo=info, fileobj=BytesIO(data))

A couple of concerns though:

not sure if using the map-reduce technique is the best way to collect the logs
the files (tarballs) are being created in memory, so depending on your application it can crush if the files are too big
perhaps there is a better way to determine the number of workers

How to get the worker logs in spark using the default cluster manager?

Question

1 answers

solution1
0 ACCPTED 2016-12-06 17:31:27

How to get the worker logs in spark using the default cluster manager?

Question

1 answers

solution1 0 ACCPTED 2016-12-06 17:31:27

solution1
0 ACCPTED 2016-12-06 17:31:27