在 Databricks / Spark 中记录附加的集群信息

Question

I would like to do some performance testing on Databricks.我想在 Databricks 上做一些性能测试。 To do this I would like to log what cluster (VM type eg Standard_DS3_v2) I was using during the test (we can assume that the driver and worker nodes are the same).为此，我想记录我在测试期间使用的集群（VM 类型，例如 Standard_DS3_v2）（我们可以假设驱动程序和工作程序节点相同）。 I know I could log the no of workers, no of cores (on the driver at least) and the memory (on the driver at least).我知道我可以记录工人数、内核数（至少在驱动程序上）和 memory（至少在驱动程序上）。 However, I would like to know the VM type since I want to be able to identify if I used eg a storage optimized or general purpose cluster.但是，我想知道 VM 类型，因为我希望能够确定我使用的是存储优化集群还是通用集群。 Instead of the VM Type this information would also be fine.此信息也可以代替 VM 类型。 Optimally, I can get this information as a string in a variable within the notebook to later write it into my log file from there with other information I am logging.最理想的是，我可以将此信息作为笔记本中变量中的字符串获取，以便稍后将其与我正在记录的其他信息一起写入我的日志文件。 However, I am also happy with any hacky workaround if there is no straight forward solution to this.但是，如果没有直接的解决方案，我也对任何 hacky 解决方法感到满意。

Answer 1

You can get this information from the REST API, via GET request to Clusters API .您可以从 REST API 获取此信息，通过GET 请求到集群 API 。 You can use notebook context to identify the cluster where the notebook is running via dbutils.notebook.getContext call that returns a map of different attributes, including the cluster ID, workspace domain name, and you can extract the authentication token from it.您可以使用笔记本上下文通过dbutils.notebook.getContext调用识别运行笔记本的集群，该调用返回 map 的不同属性，包括集群 ID、工作区域名，您可以从中提取身份验证令牌。 Here is the code that prints driver & worker node types (it's in Python, but Scala code should be familiar - I often use Scala's dbutils.notebook.getContext.tags to find what tags are available):这是打印驱动程序和工作节点类型的代码（它在 Python 中，但是 Scala 代码应该很熟悉——我经常使用 Scala 的dbutils.notebook.getContext.tags来查找可用的标签）：

import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = ctx.apiToken().get()
cluster_id = ctx.tags().get("clusterId").get()

response = requests.get(
    f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
    headers={'Authorization': f'Bearer {host_token}'}
  ).json()
print(f"driver type={response['driver_node_type_id']} worker type={response['node_type_id']}")

在 Databricks / Spark 中记录附加的集群信息

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-14 14:46:14

在 Databricks / Spark 中记录附加的集群信息

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-14 14:46:14

解决方案1
1 已采纳 2021-02-14 14:46:14