简体   繁体   中英

How to get the worker logs in spark using the default cluster manager?

I am trying to get the application output of the spark run and cannot find a straightforward way doing that.

Basically I am talking about the content of the <spark install dir>/work directory on the cluster worker.

I could've copied that directory to the location I need, but in case of 100500 nodes it simply doesn't scale.

The other option I was considering is to attach an exit function (like a TRAP in bash) to get the logs from each worker as a part of the app run. I just think there has to be a better solution than that.

Yeah, I know that I can use YARN or Mesos cluster manager to get the logs, however it seems really weird to me that in order to do such a convenient thing I cannot use the default cluster manager.

Thanks a lot.

In the end I went for the following solution (Python):

import os
import tarfile
from io import BytesIO
from pyspark.sql import SparkSession


# Get the spark app.
spark = SparkSession.builder.appName("my-spark-app").getOrCreate()
# Get the executor working directories.
spark_home = os.environ.get('SPARK_HOME')
if spark_home:
    num_workers = 0
    with open(os.path.join(spark_home, 'conf', 'slaves'), 'r') as f:
        for line in f:
            num_workers += 1
    if num_workers:
        executor_logs_path = '/where/to/store/executor_logs'

        def _map(worker):
            '''Returns the list of tuples of the name and the tar.gz of the worker log directory in binary format
            for the corresponding worker.
            '''
            flo = BytesIO()
            with tarfile.open(fileobj=flo, mode="w:gz") as tar:
                tar.add(os.path.join(spark_home, 'work'), arcname='work')
            return [('worker_%d_dir.tar.gz' % worker, flo.getvalue()),]

        def _reduce(worker1, worker2):
            '''Appends the worker name and its log tar.gz's into the list.
            '''
            worker1.extend(worker2)
            return worker1

        os.makedirs(executor_logs_path)
        logs = spark.sparkContext.parallelize(range(num_workers), num_workers).map(_map).reduce(_reduce)
        with tarfile.open(os.path.join(executor_logs_path, 'logs.tar'), 'w') as tar:
            for name, data in logs:
                info = tarfile.TarInfo(name=name)
                info.size=len(data)
                tar.addfile(tarinfo=info, fileobj=BytesIO(data))

A couple of concerns though:

  • not sure if using the map-reduce technique is the best way to collect the logs
  • the files (tarballs) are being created in memory, so depending on your application it can crush if the files are too big
  • perhaps there is a better way to determine the number of workers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM