简体   繁体   English

在spark中使用JDBC驱动程序限制到MySQL数据库的连接数

[英]Limit number of connection to MySQL database using JDBC driver in spark

I am currently importing data from a MySQL database into spark using the JDBC driver using the following command in pyspark:我目前正在使用 pyspark 中的以下命令使用 JDBC 驱动程序将数据从 MySQL 数据库导入 spark:

dataframe_mysql = sqlctx
    .read
    .format("jdbc")
    .option("url", "jdbc:mysql://<IP-ADDRESS>:3306/<DATABASE>")
    .option("driver", "com.mysql.jdbc.Driver")
    .option("dbtable", "<TABLE>")
    .option("user", "<USER>")
    .option("password","<PASSWORD>")
    .load()

When I run the spark job, I get the following error message:当我运行 spark 作业时,我收到以下错误消息:

com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException (Too many connections). com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException(连接过多)。

It seems that since several nodes are attempting to connect concurrently to the database, I am exceeding MySQL's connection limit (151) and this is causing my job to run slower.似乎由于多个节点正在尝试同时连接到数据库,我超出了MySQL 的连接限制 (151) ,这导致我的工作运行速度变慢。

How can I limit the number of connections that the JDBC driver uses in pyspark?如何限制 JDBC 驱动程序在 pyspark 中使用的连接数? Any help would be great!任何帮助都会很棒!

Try to usenumPartitions param.尝试使用numPartitions参数。 According to the documentation it is the maximum number of partitions that can be used for parallelism in table reading and writing.根据文档,它是可用于表读写并行的最大分区数。 This also determines the maximum number of concurrent JDBC connections.这也决定了并发 JDBC 连接的最大数量。 If the number of partitions to write exceeds this limit, then there is a decrease to this limit by calling coalesce(numPartitions) before writing.如果要写入的分区数超过此限制,则在写入前调用 coalesce(numPartitions) 会减少此限制。

我想您应该减少默认分区大小,或减少执行程序的数量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM