[英]Databricks init script to install Datadog not working
I'm trying to follow this guide to install and configure the Datadog agent in a cluster-scoped init script.我正在尝试按照本指南在集群范围的初始化脚本中安装和配置 Datadog 代理。 The script doesn't actually start the child script it creates to actually install and configure Datadog.该脚本实际上并没有启动它创建的用于实际安装和配置 Datadog 的子脚本。
Here's the script generated by the notebook imported from the above blog post:这是从上述博客文章导入的笔记本生成的脚本:
#!/bin/bash
echo "Running on the driver? $DB_IS_DRIVER"
echo "Driver ip: $DB_DRIVER_IP"
cat <<EOF >> /tmp/start_datadog.sh
#!/bin/bash
if [ \$DB_IS_DRIVER = "TRUE" ]; then
echo "On the driver. Installing Datadog ..."
# install the Datadog agent
DD_API_KEY=<MY_API_KEY> bash -c "\$(curl -L https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)"
# WAITING UNTIL MASTER PARAMS ARE LOADED, THEN GRABBING IP AND PORT
while [ -z \$gotparams ]; do
if [ -e "/tmp/driver-env.sh" ]; then
DB_DRIVER_PORT=\$(grep -i "CONF_UI_PORT" /tmp/driver-env.sh | cut -d'=' -f2)
gotparams=TRUE
fi
sleep 2
done
current=\$(hostname -I | xargs)
# WRITING SPARK CONFIG FILE FOR STREAMING SPARK METRICS
echo "init_config:
instances:
- resourcemanager_uri: http://\$DB_DRIVER_IP:\$DB_DRIVER_PORT
spark_cluster_mode: spark_driver_mode
cluster_name: \$current" > /etc/datadog-agent/conf.d/spark.yaml
# RESTARTING AGENT
sudo service datadog-agent restart
fi
EOF
# CLEANING UP
if [ \$DB_IS_DRIVER = "TRUE" ]; then
chmod a+x /tmp/start_datadog.sh
/tmp/start_datadog.sh >> /tmp/datadog_start.log 2>&1 & disown
fi
The event log for the cluster in the Databricks console says it finished running the init scripts but I launch Web Terminal
in the Databricks console and I see the child script is not running at all: Databricks 控制台中集群的事件日志显示它已完成运行 init 脚本,但我在 Databricks 控制台中启动Web Terminal
,我看到子脚本根本没有运行:
root@1122-180908-gh1ahrr7-10-4-32-233:/databricks/driver# sh -x /dbfs/init-scripts/datadog-install-driver-only-v2.sh
+ echo Running on the driver?
Running on the driver?
+ echo Driver ip:
Driver ip:
+ cat
+ [ $DB_IS_DRIVER = TRUE ]
root@1122-180908-gh1ahrr7-10-4-32-233:/databricks/driver#
root@1122-180908-gh1ahrr7-10-4-39-90:/databricks/driver# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.4.39.90 netmask 255.255.0.0 broadcast 10.4.255.255
...
root@1122-180908-gh1ahrr7-10-4-39-90:/databricks/driver# cat /dbfs/cluster-logs/1122-180908-gh1ahrr7/init_scripts/1122-180908-gh1ahrr7_10_4_39_90/20230109_212735_00_datadog-install-driver-only-v2.sh.stdout.log
Running on the driver? TRUE
Driver ip: 10.4.39.90
root@1122-180908-gh1ahrr7-10-4-39-90:/databricks/driver#
I'm also looking at the Advanced cluster config in the Databricks console and and see the Driver hostname
mentioned under the SSH config is different: 10.4.58.10
.我还在 Databricks 控制台中查看高级集群配置,并看到 SSH 配置下提到的Driver hostname
不同: 10.4.58.10
。
Why is the SSH IP different from the IP the Web Terminal logs me into?为什么 SSH IP 不同于 IP Web 终端让我登录? Is this the reason for the cluster init script not working?这是集群初始化脚本不起作用的原因吗? And what is the solution?解决方案是什么?
you can add this to just before the check condition and it should work.您可以将其添加到检查条件之前,它应该可以工作。
cat <<EOF > /tmp/start_datadog.sh
#!/bin/bash
if [ -f /databricks/driver/conf/spark-branch.conf ]; then
export DB_IS_DRIVER=TRUE
echo $DB_IS_DRIVER
fi
if [[ \${DB_IS_DRIVER} = "TRUE" ]]; then
echo "Installing Datadog Agent on the driver..."
How ever I ran into other issues where I had to manually set cluster name and now spark url is also not getting formed correctly.我是如何遇到其他问题的,我必须手动设置集群名称,现在 spark url 也没有正确形成。 Looking on troubleshooting that.寻找故障排除。
Slightly OT perhaps, but there it is - please have the sum total of wisdom gleaned for 2 months of attacking this stuff.也许有点 OT,但确实如此 - 请收集 2 个月来攻击这些东西的智慧总和。
Take a look at this script:看看这个脚本:
https://gist.github.com/abij/95b47edf8b6d176fba9ec796da96b715 https://gist.github.com/abij/95b47edf8b6d176fba9ec796da96b715
This kind person has coalesced the standalone and single node cluster logic into one script, which should help with driver vs. driver+worker case.这种人已经将独立和单节点集群逻辑合并到一个脚本中,这应该有助于处理驱动程序与驱动程序+工作程序的情况。
For those having hostname issues with the agent from version 7.40.x, you can add this to your Spark env config (don't add quotes; the agent install script won't strip them out when building the download url):对于 7.40.x 版代理的主机名问题,您可以将其添加到您的 Spark 环境配置中(不要添加引号;代理安装脚本在构建下载 URL 时不会删除它们):
DD_AGENT_MINOR_VERSION=39.0
The Datadog integration page seems to also have been recently updated to address this hostname issue: Datadog 集成页面最近似乎也已更新以解决此主机名问题:
# CONFIGURE HOSTNAME EXPLICITLY IN datadog.yaml TO PREVENT AGENT FROM FAILING ON VERSION 7.40+
# SEE https://github.com/DataDog/datadog-agent/issues/14152 FOR CHANGE
hostname=\$(hostname | xargs)
echo "hostname: \$hostname" >> /etc/datadog-agent/datadog.yaml
Also, be prepared to have to change the process_config
expvar_port
from the default of 6062
to something like 6163
, as the Spark java procs will attempt to bind()
to ports in the 606x
range, which means the Spark driver won't even come up.此外,准备好必须将process_config
expvar_port
从默认值6062
更改为类似6163
的东西,因为 Spark java 过程将尝试bind()
到606x
范围内的端口,这意味着 Spark 驱动程序甚至不会出现.
The datadog-agent
config offers the DD_PROCESS_CONFIG_EXPVAR_PORT
as a way to override this (without having to sed
the config from the init script), but - from spelunking through the datadog-agent
, it appears the envvar is not bound/considered by the agent, and is therefore inert.. There is an Issue in for this . datadog-agent
配置提供了DD_PROCESS_CONFIG_EXPVAR_PORT
作为覆盖它的方法(无需sed
来自 init 脚本的配置),但是- 从通过datadog-agent
的洞穴探险,似乎 envvar 没有被代理绑定/考虑,因此是惰性的。这有一个问题。
Save yourself headaches and use the other way around this via the Spark config :避免让自己头疼,并通过Spark 配置使用其他方式解决这个问题:
By default, ipywidgets occupies port 6062. With Databricks Runtime 11.2 and above, if you run into conflicts with third-party integrations such as Datadog, you can change the port using the following Spark config:
spark.databricks.driver.ipykernel.commChannelPort <port-number>
For example:
spark.databricks.driver.ipykernel.commChannelPort 1234
The Spark config must be set when the cluster is created.
Hoping this can save some time and energy for others.希望这可以为其他人节省一些时间和精力。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.