简体   繁体   中英

How to copy EMR streaming job logs to S3 and clean logs on EMR core node's disk

Good day,

I am running a Flink (v1.7.1) streaming job on AWS EMR 5.20, and I would like to have all task_managers and job_manager's logs of my job in S3. Logback is used as recommended by the Flink team. As it is a long-running job, I want the logs to be:

  1. Copied to S3 periodically
  2. Rolling either on time or size or both (as there might be a huge amount of logs)
  3. Get cleaned from the internal disk of the EMR nodes (otherwise the disks will become full)

What I have tried are:

  1. Enabled logging to S3 when creating the EMR cluster
  2. Configured yarn rolling logs with: yarn.log-aggregation-enable, yarn.nodemanager.remote-app-log-dir, yarn.log-aggregation.retain-seconds, yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
  3. Configured rolling logs in logback.xml:
<appender name="ROLLING" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${log.file}</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>%d{yyyy-MM-dd HH}.%i.log</fileNamePattern>
            <maxFileSize>30MB</maxFileSize>    
            <maxHistory>3</maxHistory>
            <totalSizeCap>50MB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{60} %X{sourceThread} - %msg%n</pattern>
        </encoder>
    </appender>

What I got/observed until now are:

  • (1) did help with periodically copying the logs file to S3
  • (2) seemed useless for me until now. Logs are only aggregated when the streaming job ended, and now rolling was observed.
  • (3) yielded some result, but not close to requirements yet:
    • the rolling logs are there in some cache folder ( /mnt/yarn/usercache/hadoop/appcache/application_1549236419773_0002/container_1549236419773_0002_01_000002 )
    • only the last rolling logs file is available in the usual YARN logs folder ( /mnt/var/log/hadoop-yarn/containers/application_1549236419773_0002/container_1549236419773_0002_01_000002 )
    • only the last rolling logs file is available in S3

In short, out of the 3 requirements I got, I could only either (1) or (2&3).

Could you please help me with this?

Thanks and best regards,

Averell

From what I know, the auto-backup of logs to S3 that EMR supports will only work at the end of the job, since it's based on the background log-loader that was originally implemented by AWS for batch jobs. Maybe there's a way to get it to work for rolling logs, I just have never heard about it.

I haven't tried this myself, but if I had to then I'd probably try the following:

  1. Mount S3 on your EC2 instances via S3fs .
  2. Set up logrotate (or equivalent) to automatically copy and clean up the log files.

You can use a bootstrap action to automatically set up all of the above.

If S3fs gives you problems, then you can do a bit more scripting and directly use the aws s3 command to sync logs, and then remove them once they've been copied.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM