简体   繁体   中英

How to copy yarn ssh logs automatically using scala to blob storage

We have a requirement to download the yarn ssh logs to blob storage automatically. I found that the yarn logs does get added to storage account under /app-logs/user/logs/ etc path but they are in a binary format and there is no documented way to convert these into text format. So we are trying to run the external command yarn logs -application <application_id> using scala at the end of our application run to capture the logs and save them to the blob storage but facing issues with that. Looking for a solution to get these logs automatically downloaded to storage account as part of the spark pipeline itself.

I tried redirecting the output of the yarn logs command to a temp file and then copying the file from local to blob storage. These commands work fine when I ssh into the head node of the spark cluster and run them. But they are not working when executed from jupyter notebook or scala application.

("yarn logs -applicationId application_1561088998595_xxx >  /tmp/yarnlog_2.txt") !!

("hadoop dfs -fs wasbs://dev52mss@sahdimssperfdev.blob.core.windows.net -copyFromLocal /tmp/yarnlog_2.txt /tmp/") !!

When I run these commands using jupyter notebook, the first command works fine to redirect to a local file but the second one to move the file to blob fails with the following error:

warning: there was one feature warning; re-run with -feature for details java.lang.RuntimeException: Nonzero exit value: 1 at scala.sys.package$.error(package.scala:27) at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132) at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102) ... 56 elided

Initially I tried capturing the output of the command as a Dataframe and writing the dataframe to blob. It succeeded for small logs but for huge logs it failed with the error:

Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values

val yarnLog = Seq(Process("yarn logs -applicationId " + "application_1560960859861_0003").!!).toDF()
yarnLog.write.mode("overwrite").text("wasbs://container@storageAccount.blob.core.windows.net/Dev/Logs/application_1560960859861_0003.txt")

Note: You can directly access the log files using Azure Storage => Blobs => Select Container => app logs

在此处输入图片说明

Azure HDInsight stores its log files both in the cluster file system and in Azure storage . You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage.

Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.

For more details, refer " Manage logs for Azure HDInsight cluster ".

Hope this helps.

Currently, you will need to use the 'yarn logs' command to view Yarn logs.

As regards your requirement, there are two methods to achieve this;

Method 1:

Schedule a daily copy of the app-logs folder into a desired container within the blob storage. This will do a differential copy every day at a specific time of the day. For this one, I had to use Azure Data Factory to achieve the scheduling. Quite easy and no manual copy or coding required. However, because the yarn applications logs are stored in TFile binary format and can only be read using 'yarn logs' command, it means that you need to have another tool application to read the file when from the destination later on. You can use the tool here to read the files https://github.com/shanyu/hadooplogparser

Alternatively, you can have your own simple script that converts it to a readable file before the transfer. Sample script below

**

yarn logs -applicationId application_15645293xxxxx > /tmp/source/applog_back.txt

hadoop dfs -fs wasbs://hdiblob @sandboxblob.blob.core.windows.net -copyFromLocal /tmp/source/applog_back.txt /tmp/destination

**

Method 2:

This is the simplest and cheapest method. You can disable the retention period of the Yarn Application logs, this means the logs will be retained indefinitely. To do this, change the config “yarn.log-aggregation.retain-seconds” to value -1. This config can be found in yarn-site.xml.

Once this is done, you can always read your Yarn Applications logs anytime from the cluster using the Yarn UI or CLI.

Hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM