简体   繁体   English

用于安装 Datadog 的 Databricks 初始化脚本不起作用

[英]Databricks init script to install Datadog not working

I'm trying to follow this guide to install and configure the Datadog agent in a cluster-scoped init script.我正在尝试按照本指南在集群范围的初始化脚本中安装和配置 Datadog 代理。 The script doesn't actually start the child script it creates to actually install and configure Datadog.该脚本实际上并没有启动它创建的用于实际安装和配置 Datadog 的子脚本。

Here's the script generated by the notebook imported from the above blog post:这是从上述博客文章导入的笔记本生成的脚本:

#!/bin/bash

echo "Running on the driver? $DB_IS_DRIVER"
echo "Driver ip: $DB_DRIVER_IP"

cat <<EOF >> /tmp/start_datadog.sh
#!/bin/bash

if [ \$DB_IS_DRIVER = "TRUE" ]; then
  echo "On the driver. Installing Datadog ..."
  
  # install the Datadog agent
  DD_API_KEY=<MY_API_KEY> bash -c "\$(curl -L https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)"
  
  # WAITING UNTIL MASTER PARAMS ARE LOADED, THEN GRABBING IP AND PORT
  while [ -z \$gotparams ]; do
    if [ -e "/tmp/driver-env.sh" ]; then
      DB_DRIVER_PORT=\$(grep -i "CONF_UI_PORT" /tmp/driver-env.sh | cut -d'=' -f2)
      gotparams=TRUE
    fi
    sleep 2
  done

  current=\$(hostname -I | xargs)  
  
  # WRITING SPARK CONFIG FILE FOR STREAMING SPARK METRICS
  echo "init_config:
instances:
    - resourcemanager_uri: http://\$DB_DRIVER_IP:\$DB_DRIVER_PORT
      spark_cluster_mode: spark_driver_mode
      cluster_name: \$current" > /etc/datadog-agent/conf.d/spark.yaml

  # RESTARTING AGENT
  sudo service datadog-agent restart

fi
EOF

# CLEANING UP
if [ \$DB_IS_DRIVER = "TRUE" ]; then
  chmod a+x /tmp/start_datadog.sh
  /tmp/start_datadog.sh >> /tmp/datadog_start.log 2>&1 & disown
fi

The event log for the cluster in the Databricks console says it finished running the init scripts but I launch Web Terminal in the Databricks console and I see the child script is not running at all: Databricks 控制台中集群的事件日志显示它已完成运行 init 脚本,但我在 Databricks 控制台中启动Web Terminal ,我看到子脚本根本没有运行:

root@1122-180908-gh1ahrr7-10-4-32-233:/databricks/driver# sh -x /dbfs/init-scripts/datadog-install-driver-only-v2.sh
+ echo Running on the driver? 
Running on the driver? 
+ echo Driver ip: 
Driver ip: 
+ cat
+ [ $DB_IS_DRIVER = TRUE ]
root@1122-180908-gh1ahrr7-10-4-32-233:/databricks/driver# 

root@1122-180908-gh1ahrr7-10-4-39-90:/databricks/driver# ifconfig 
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
      inet 10.4.39.90  netmask 255.255.0.0  broadcast 10.4.255.255
...

root@1122-180908-gh1ahrr7-10-4-39-90:/databricks/driver# cat /dbfs/cluster-logs/1122-180908-gh1ahrr7/init_scripts/1122-180908-gh1ahrr7_10_4_39_90/20230109_212735_00_datadog-install-driver-only-v2.sh.stdout.log
Running on the driver? TRUE
Driver ip: 10.4.39.90
root@1122-180908-gh1ahrr7-10-4-39-90:/databricks/driver# 

I'm also looking at the Advanced cluster config in the Databricks console and and see the Driver hostname mentioned under the SSH config is different: 10.4.58.10 .我还在 Databricks 控制台中查看高级集群配置,并看到 SSH 配置下提到的Driver hostname不同: 10.4.58.10

Why is the SSH IP different from the IP the Web Terminal logs me into?为什么 SSH IP 不同于 IP Web 终端让我登录? Is this the reason for the cluster init script not working?这是集群初始化脚本不起作用的原因吗? And what is the solution?解决方案是什么?

you can add this to just before the check condition and it should work.您可以将其添加到检查条件之前,它应该可以工作。

cat <<EOF > /tmp/start_datadog.sh
#!/bin/bash

if [ -f /databricks/driver/conf/spark-branch.conf ]; then
    export DB_IS_DRIVER=TRUE
    echo $DB_IS_DRIVER
fi

if [[ \${DB_IS_DRIVER} = "TRUE" ]]; then
  
  echo "Installing Datadog Agent on the driver..."

How ever I ran into other issues where I had to manually set cluster name and now spark url is also not getting formed correctly.我是如何遇到其他问题的,我必须手动设置集群名称,现在 spark url 也没有正确形成。 Looking on troubleshooting that.寻找故障排除。

Slightly OT perhaps, but there it is - please have the sum total of wisdom gleaned for 2 months of attacking this stuff.也许有点 OT,但确实如此 - 请收集 2 个月来攻击这些东西的智慧总和。

Take a look at this script:看看这个脚本:

https://gist.github.com/abij/95b47edf8b6d176fba9ec796da96b715 https://gist.github.com/abij/95b47edf8b6d176fba9ec796da96b715

This kind person has coalesced the standalone and single node cluster logic into one script, which should help with driver vs. driver+worker case.这种人已经将独立和单节点集群逻辑合并到一个脚本中,这应该有助于处理驱动程序与驱动程序+工作程序的情况。

For those having hostname issues with the agent from version 7.40.x, you can add this to your Spark env config (don't add quotes; the agent install script won't strip them out when building the download url):对于 7.40.x 版代理的主机名问题,您可以将其添加到您的 Spark 环境配置中(不要添加引号;代理安装脚本在构建下载 URL 时不会删除它们):

DD_AGENT_MINOR_VERSION=39.0

The Datadog integration page seems to also have been recently updated to address this hostname issue: Datadog 集成页面最近似乎也已更新以解决此主机名问题:

# CONFIGURE HOSTNAME EXPLICITLY IN datadog.yaml TO PREVENT AGENT FROM FAILING ON VERSION 7.40+
  # SEE https://github.com/DataDog/datadog-agent/issues/14152 FOR CHANGE
  hostname=\$(hostname | xargs)
  echo "hostname: \$hostname" >> /etc/datadog-agent/datadog.yaml

Also, be prepared to have to change the process_config expvar_port from the default of 6062 to something like 6163 , as the Spark java procs will attempt to bind() to ports in the 606x range, which means the Spark driver won't even come up.此外,准备好必须将process_config expvar_port从默认值6062更改为类似6163的东西,因为 Spark java 过程将尝试bind()606x范围内的端口,这意味着 Spark 驱动程序甚至不会出现.

The datadog-agent config offers the DD_PROCESS_CONFIG_EXPVAR_PORT as a way to override this (without having to sed the config from the init script), but - from spelunking through the datadog-agent , it appears the envvar is not bound/considered by the agent, and is therefore inert.. There is an Issue in for this . datadog-agent配置提供了DD_PROCESS_CONFIG_EXPVAR_PORT作为覆盖它的方法(无需sed来自 init 脚本的配置),但是- 从通过datadog-agent的洞穴探险,似乎 envvar 没有被代理绑定/考虑,因此是惰性的。这有一个问题

Save yourself headaches and use the other way around this via the Spark config :避免让自己头疼,并通过Spark 配置使用其他方式解决这个问题:

By default, ipywidgets occupies port 6062. With Databricks Runtime 11.2 and above, if you run into conflicts with third-party integrations such as Datadog, you can change the port using the following Spark config:

spark.databricks.driver.ipykernel.commChannelPort <port-number>

For example:

spark.databricks.driver.ipykernel.commChannelPort 1234

The Spark config must be set when the cluster is created.

Hoping this can save some time and energy for others.希望这可以为其他人节省一些时间和精力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM