简体繁体中英

How to copy EMR streaming job logs to S3 and clean logs on EMR core node's disk

原文 2019-02-04 04:53:45 1 1 logging/ streaming/ yarn/ amazon-emr/ flink-streaming

Good day,

I am running a Flink (v1.7.1) streaming job on AWS EMR 5.20, and I would like to have all task_managers and job_manager's logs of my job in S3. Logback is used as recommended by the Flink team. As it is a long-running job, I want the logs to be:

Copied to S3 periodically
Rolling either on time or size or both (as there might be a huge amount of logs)
Get cleaned from the internal disk of the EMR nodes (otherwise the disks will become full)

What I have tried are:

Enabled logging to S3 when creating the EMR cluster
Configured yarn rolling logs with: yarn.log-aggregation-enable, yarn.nodemanager.remote-app-log-dir, yarn.log-aggregation.retain-seconds, yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
Configured rolling logs in logback.xml:

<appender name="ROLLING" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${log.file}</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>%d{yyyy-MM-dd HH}.%i.log</fileNamePattern>
            <maxFileSize>30MB</maxFileSize>    
            <maxHistory>3</maxHistory>
            <totalSizeCap>50MB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{60} %X{sourceThread} - %msg%n</pattern>
        </encoder>
    </appender>

What I got/observed until now are:

(1) did help with periodically copying the logs file to S3
(2) seemed useless for me until now. Logs are only aggregated when the streaming job ended, and now rolling was observed.
(3) yielded some result, but not close to requirements yet:
- the rolling logs are there in some cache folder ( /mnt/yarn/usercache/hadoop/appcache/application_1549236419773_0002/container_1549236419773_0002_01_000002 )
- only the last rolling logs file is available in the usual YARN logs folder ( /mnt/var/log/hadoop-yarn/containers/application_1549236419773_0002/container_1549236419773_0002_01_000002 )
- only the last rolling logs file is available in S3

In short, out of the 3 requirements I got, I could only either (1) or (2&3).

Could you please help me with this?

Thanks and best regards,

Averell

1 answers

From what I know, the auto-backup of logs to S3 that EMR supports will only work at the end of the job, since it's based on the background log-loader that was originally implemented by AWS for batch jobs. Maybe there's a way to get it to work for rolling logs, I just have never heard about it.

I haven't tried this myself, but if I had to then I'd probably try the following:

Mount S3 on your EC2 instances via S3fs .
Set up logrotate (or equivalent) to automatically copy and clean up the log files.

You can use a bootstrap action to automatically set up all of the above.

If S3fs gives you problems, then you can do a bit more scripting and directly use the aws s3 command to sync logs, and then remove them once they've been copied.

Where does YARN application logs get stored in EMR before sending to S3

How to use S3 as storage for Loki logs?

Send heroku logs to s3

AWS S3 Logs: AccessDenied

Spark application log messages not showing in EMR logs

How to include the full path in Elastic Beanstalk logs published to S3?

Spark streaming on YARN executor's logs not available

How to filter logs only with developer's logs?

Using Python Logging module to save logs to S3, how to capture all levels of logs for all modules?

Logstash traverse s3 directory tree for logs

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Where does YARN application logs get stored in EMR before sending to S3 How to use S3 as storage for Loki logs? Send heroku logs to s3 AWS S3 Logs: AccessDenied Spark application log messages not showing in EMR logs How to include the full path in Elastic Beanstalk logs published to S3? Spark streaming on YARN executor's logs not available How to filter logs only with developer's logs? Using Python Logging module to save logs to S3, how to capture all levels of logs for all modules? Logstash traverse s3 directory tree for logs

Related Tags

How to copy EMR streaming job logs to S3 and clean logs on EMR core node's disk

Question

1 answers

solution1 0 2019-02-04 19:07:22

solution1
0 2019-02-04 19:07:22