Spark Structured Streaming - stderr 被填滿

Question

我在 GCP Dataproc 上有一個 Spark Structured Streaming 作業——它從 Kafka 中獲取數據，進行處理並將數據推送回 kafka 主題。

幾個問題：

Spark 是否將所有日志（包括 INFO、WARN 等）放入 stderr？ 我注意到 stdout 是空的，而所有日志記錄都放入 stderr
有沒有辦法讓我使 stderr 中的數據過期（即使舊日志過期）？ 由於我有一個長時間運行的流式作業，stderr 會隨着時間的推移而被填滿，節點/虛擬機變得不可用。

請建議。

這里是 yarn logs 命令的 output：

root@versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:25:34,876 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:25:35,144 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:25:35 +0000 2022
LogLength:43251469683
LogContents:
 applianceName=usa-isn0784-rt01, tenantName=NOV, mstatsTimeBlock=1663507200, tenantId=2, vsnId=0, mstatsTotSentOctets=11596, mstatsTotRecvdOctets=24481, mstatsTotSessDuration=300000, mstatsTotSessCount=1, mstatsType=sdwan-acc-ckt-app-stats, appId=https, site=usa-isn0784-rt01, accCkt=WAN-DIA, siteId=442, accCktId=1, user=10.126.117.196, risk=3, productivity=3, family=general-internet, subFamily=web, bzTag=Unknown,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa  type(row) is ->  <class 'str'>
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************


Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.
***********************************************************************


root@versa-structured-stream-v1-w-1:/home/karanalang# yarn logs -applicationId application_1663623368960_0008 -log_files stderr -size -500
2022-09-19 23:26:01,439 INFO client.RMProxy: Connecting to ResourceManager at versa-structured-stream-v1-m/10.142.0.62:8032
2022-09-19 23:26:01,696 INFO client.AHSProxy: Connecting to Application History server at versa-structured-stream-v1-m/10.142.0.62:10200
Can not find any log file matching the pattern: [stderr] for the container: container_e01_1663623368960_0008_01_000003 within the application: application_1663623368960_0008
Container: container_e01_1663623368960_0008_01_000002 on versa-structured-stream-v1-w-2.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 23:26:02 +0000 2022
LogLength:44309782124
LogContents:
, tenantId=3, vsnId=0, mstatsTotSentOctets=48210, mstatsTotRecvdOctets=242351, mstatsTotSessDuration=300000, mstatsTotSessCount=34, mstatsType=dest-stats, destIp=165.225.216.24, mstatsAttribs=,topic=syslog.ueba-us4.v1.versa.demo3,customer=versa  type(row) is ->  <class 'str'>
22/09/19 23:26:02 WARN org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer: KafkaDataConsumer is not running in UninterruptibleThread. It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894
End of LogType:stderr.This log file belongs to a running container (container_e01_1663623368960_0008_01_000002) and so may not be complete.
***********************************************************************


Container: container_e01_1663623368960_0008_01_000001 on versa-structured-stream-v1-w-1.c.versa-sml-googl.internal:8026
LogAggregationType: LOCAL
=======================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Sep 19 22:54:55 +0000 2022
LogLength:17367929
LogContents:
on syslog.ueba-us4.v1.versa.demo3-2
22/09/19 22:52:52 INFO org.apache.kafka.clients.consumer.internals.SubscriptionState: [Consumer clientId=consumer-spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor-1, groupId=spark-kafka-source-0f984ad9-f663-4ce1-9ef1-349419f3e6ec-1714963016-executor] Resetting offset for partition syslog.ueba-us4.v1.versa.demo3-2 to offset 449568676.
22/09/19 22:54:55 ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
End of LogType:stderr.

Answer 1

簡短的回答：

您可以使用帶有 RollingFileAppender 的自定義 log4j 配置來限制流作業的日志大小。

長答案：

Dataproc 上 Spark 的默認 log4j 配置位於/etc/spark/conf/log4j.properties 。 它在 INFO 級別將根記錄器配置為 stderr。 但是在運行時，驅動程序日志（在客戶端模式下）將由 Dataproc 代理定向到 GCS 並流回客戶端，而執行程序日志（以及集群模式下的驅動程序日志）將由 YARN 重定向到容器的 YARN 中的stderr文件日志目錄。 考慮使用/etc/spark/conf/log4j.properties作為自定義配置的模板。

在您的自定義配置中，您希望將日志配置為寫入 RollingFileAppender，例如，

log4j.rootLogger=INFO, rolling_file

log4j.appender.rolling_file=org.apache.log4j.RollingFileAppender
log4j.appender.rolling_file.File=${spark.yarn.app.container.log.dir}/my_app.log
log4j.appender.rolling_file.MaxFileSize=100MB
log4j.appender.rolling_file.MaxBackupIndex=10
...

請注意，對於客戶端模式下的執行程序和驅動程序， log4j.appender.rolling_file.File的值需要是${spark.yarn.app.container.log.dir}下的路徑，請參閱此問題和此文檔。

將您的 log4j 配置上傳到 GCS 存儲桶，驅動程序配置和執行程序配置可能相同也可能不同。 在您的情況下，您可能只想更新執行程序 log4j 配置，只需使用驅動程序的默認值。

然后使用自定義 log4j 配置提交作業：

gcloud dataproc jobs submit spark ... \
  --files gs://my-bucket/my-log4j.properties \
  --properties 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:my-log4j.properties'

預計在 YARN 容器日志目錄下將有 Spark 執行程序的滾動日志，它們將自動聚合並存儲在 GCS 和 Cloud Logging 中。

Spark Structured Streaming - stderr 被填滿

問題描述

1 個解決方案

解決方案1
0 2022-09-20 04:54:39

Spark Structured Streaming - stderr 被填滿

問題描述

1 個解決方案

解決方案1 0 2022-09-20 04:54:39

解決方案1
0 2022-09-20 04:54:39