Pyspark 读取 jdbc 给出错误。怎么修？

Question

I am connecting to RDS MySQL using JDBC in pyspark. I have tried almost everything that I found on Stackoverflow for debugging but still, i am unable to make it work.我正在使用 pyspark 中的 JDBC 连接到 RDS MySQL。我已经尝试了几乎所有我在 Stackoverflow 上找到的用于调试的东西，但我仍然无法让它工作。

spark = SparkSession.builder.config("spark.jars", mysql_jar) \
            .master("local[*]").appName("PySpark_MySQL_test").getOrCreate()
df= spark.read.format("jdbc").option("url", "jdbc:mysql://hostname.amazonaws.com:1150/dbname?user=user_name&password=password") \
            .option("driver", "com.mysql.cj.jdbc.Driver").option("dbtable", "table_name").load()

I have tried using the same connection details in pymysql library of python it connects and brings back the result.我尝试在 python 的 pymysql 库中使用相同的连接详细信息，它连接并返回结果。
But here I getting the below error and am unable to solve it.但是在这里我收到以下错误并且无法解决。

raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
    at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174)
    at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64)
    at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:827)
    at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:447)
    at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:237)
    at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:199)
    at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
    at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure

Answer 1

I have experienced the same issues.Now it is worked.The core reason is spark use master node to connect mysql and use work nodes to execute task.So you can connect mysql while raise communication error.Based on this theory,you can open the security rules on mysql to let all spark node can connect to mysql我也遇到过同样的问题。现在可以了。核心原因是spark使用主节点连接mysql并使用工作节点执行任务。所以你可以连接mysql同时引发通信错误。基于这个理论，你可以打开mysql 上的安全规则，让所有 Spark 节点都可以连接到 mysql

Answer 2

For anyone coming here for an answer using Docker give the below solution a try.对于使用 Docker 来这里寻求答案的任何人，请尝试以下解决方案。 use the below configuration使用以下配置

source_df = spark.read.format('jdbc').options(
        url='jdbc:mysql://host.docker.internal:3306/superset?useSSL=false&allowPublicKeyRetrieval=true',
        driver='com.mysql.cj.jdbc.Driver',
        dbtable='table',
        user='root',
        password='root').load()

I have tried the host with localhost , 127.0.0.1 , and even the IPAddress from docker inspect but didn't work then changed it to host.docker.internal and it worked.我已经尝试使用localhost 、 127.0.0.1甚至来自 docker 检查的IPAddress的主机，但没有工作，然后将其更改为host.docker.internal并且它工作。

Pyspark 读取 jdbc 给出错误。怎么修？

问题描述

2 个解决方案

解决方案1
0 2022-08-26 03:51:44

解决方案2
0 2023-01-03 12:14:30

Pyspark 读取 jdbc 给出错误。 怎么修？

问题描述

2 个解决方案

解决方案1 0 2022-08-26 03:51:44

解决方案2 0 2023-01-03 12:14:30

Pyspark 读取 jdbc 给出错误。怎么修？

解决方案1
0 2022-08-26 03:51:44

解决方案2
0 2023-01-03 12:14:30