Databricks 連接 & PyCharm & 遠程 SSH 連接

Question

嘿 StackOverflowers！

我遇到了一個問題。

我已將 PyCharm 設置為通過 SSH 連接與（天藍色）VM 連接。

所以首先我為 ssh 連接進行配置
我設置了映射
我通過在 vm 中啟動一個終端來創建一個 conda 環境，然后我下載並連接到 databricks-connect。 我在終端上測試它，它工作正常。
我在 pycharm 配置上設置了控制台

但是當我嘗試運行 spark session (spark = SparkSession.builder.getOrCreate()) 時，databricks-connect 在錯誤的文件夾中搜索.databricks-connect 文件並給我以下錯誤：

Caused by: java.lang.RuntimeException: Config file /root/.databricks-connect not found. Please run Caused by: java.lang.RuntimeException: Config file /root/.databricks-connect not found. Please run databricks-connect configure to accept the end user license agreement and configure Databricks Connect. A copy of the EULA is provided below: Copyright (2018) Databricks, Inc. to accept the end user license agreement and configure Databricks Connect. A copy of the EULA is provided below: Copyright (2018) Databricks, Inc.

和完整的錯誤+一些警告。

20/07/10 17:23:05 WARN Utils: Your hostname, george resolves to a loopback address: 127.0.0.1; using 10.0.0.4 instead (on interface eth0)
20/07/10 17:23:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/07/10 17:23:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Traceback (most recent call last):
  File "/anaconda/envs/py37/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-23fe18298795>", line 1, in <module>
    runfile('/home/azureuser/code/model/check_vm.py')
  File "/home/azureuser/.pycharm_helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/home/azureuser/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/azureuser/code/model/check_vm.py", line 13, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/sql/session.py", line 185, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 373, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 137, in __init__
    conf, jsc, profiler_cls)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 199, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/pyspark/context.py", line 312, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py", line 1525, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/anaconda/envs/py37/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.ExceptionInInitializerError
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:99)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:250)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Config file /root/.databricks-connect not found. Please run `databricks-connect configure` to accept the end user license agreement and configure Databricks Connect. A copy of the EULA is provided below: Copyright (2018) Databricks, Inc.
This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services pursuant to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). This Software shall be deemed part of the “Subscription Services” under the Agreement, or if the Agreement does not define Subscription Services, then the term in such Agreement that refers to the applicable Databricks Platform Services (as defined below) shall be substituted herein for “Subscription Services.”  Licensee's use of the Software must comply at all times with any restrictions applicable to the Subscription Services, generally, and must be used in accordance with any applicable documentation. If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software.  This license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms.
Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services. Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used.
Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company.
To accept this agreement and start using Databricks Connect, run `databricks-connect configure` in a shell.
    at com.databricks.spark.util.DatabricksConnectConf$.checkEula(DatabricksConnectConf.scala:41)
    at org.apache.spark.SparkContext$.<init>(SparkContext.scala:2679)
    at org.apache.spark.SparkContext$.<clinit>(SparkContext.scala)
    ... 13 more

但是，我沒有對該文件夾的訪問權限，因此我無法將 databricks 連接文件放在那里。

同樣奇怪的是，如果我運行： Pycharm -> ssh 終端 -> 激活 conda env -> python 以下

這是一種方法嗎：

1. Point out to java where the databricks-connect file is

2. Configure databricks-connect in another way throughout the script or enviromental variables inside pycharm

3. Other way? 

or do I miss something?

Answer 1

這似乎是關於如何做你想做的事情的官方教程（即數據塊連接）。

很可能，您的 .databricks-connect 文件版本錯誤。

您需要使用 Java 8 而不是 11，Databricks Runtime 5.5 LTS 或 Databricks Runtime 6.1-6.6，並且您的 python 版本兩端應相同。

以下是他們給出的步驟：

conda create --name dbconnect python=3.5
pip uninstall pyspark
pip install -U databricks-connect==5.5.*  # or 6.*.* to match your cluster version. 6.1-6.6 are supported

然后您需要 url、令牌、集群 ID、組織 ID 和端口。 最后在終端上運行這個命令：

databricks-connect configure
databricks-connect test

在那之后還有更多工作要做，但這應該會奏效。 請記住，您需要確保您使用的所有程序都兼容。 完成所有設置后，嘗試設置 ide (pycharm) 以使其工作。

Answer 2

從錯誤中我看到您需要接受 databricks 的條款和條件，其次按照 pycharm IDE databricks的這些說明

命令行界面
跑
```
databricks-connect configure
```
許可證顯示：

復制到剪貼板復制版權所有 (2018) Databricks, Inc.

除非與被許可人根據協議使用 Databricks 平台服務有關，否則不得使用此庫（“軟件”）......

接受許可並提供配置值。
```
 Do you accept the above agreement? [y/N] y
```
設置新的配置值（將輸入留空以接受默認值）：Databricks 主機 [無當前值，必須以 https://] 開頭：Databricks 令牌 [無當前值]：集群 ID（例如，0921-001415-jelly628）[否當前值]：組織 ID（僅限 Azure，請參閱 URL 中的？o=orgId）[0]：端口 [15001]：
Databricks Connect 配置腳本會自動將 package 添加到您的項目配置中。
Python 3 個集群 Go 運行 > 編輯配置。
添加PYSPARK_PYTHON=python3作為環境變量。
Python 3 集群配置

Answer 3

最后，您是否設法在 Databricks 上設置了遠程 Pycharm ssh 解釋器。 我目前正在評估 Databricks 是否可以為我正在從事的項目完成這項工作。

據我了解， databricks-connect僅有助於在遠程計算機上啟動 Spark 作業，而您的非 Spark 代碼的 rest 在本地執行...

Databricks 連接 & PyCharm & 遠程 SSH 連接

問題描述

3 個解決方案

解決方案1
1 2020-07-14 14:38:19

解決方案2
1 2020-07-20 17:23:02

解決方案3
1 2021-04-22 14:19:48

Databricks 連接 &amp; PyCharm &amp; 遠程 SSH 連接

問題描述

3 個解決方案

解決方案1 1 2020-07-14 14:38:19

解決方案2 1 2020-07-20 17:23:02

解決方案3 1 2021-04-22 14:19:48

Databricks 連接 & PyCharm & 遠程 SSH 連接

解決方案1
1 2020-07-14 14:38:19

解決方案2
1 2020-07-20 17:23:02

解決方案3
1 2021-04-22 14:19:48