简体   繁体   English

将 Spark 与 Flask 与 JDBC 结合使用

[英]Using Spark with Flask with JDBC

What am I doing?我在做什么?

I want to build an API service using Flask to extract data from one database, do some data analysis and then load the new data into a separate DB.我想使用 Flask 构建一个 API 服务,从一个数据库中提取数据,进行一些数据分析,然后将新数据加载到一个单独的数据库中。

What is wrong?怎么了?

If I run Spark by itself, I can access the db, perform analysis and load to db.如果我自己运行 Spark,我可以访问数据库,执行分析并加载到数据库。 But the same functions would not work when using them in a Flask application(api routes).但是在 Flask 应用程序(api 路由)中使用它们时,相同的功能将不起作用。

How am I doing it?我是怎么做的?

First I start the Spark master and worker.首先我启动 Spark master 和 worker。 I can see I have one worker in localhost:8080 under the master.我可以看到我在localhost:8080的主人下有一个工人。

export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)

../sbin/start-master.sh
../sbin/start-slave.sh spark://xxx.local:7077

For the Flask application:对于 Flask 应用程序:

app = Flask(__name__)

spark = SparkSession\
    .builder\
    .appName("Flark - Flask on Spark")\
    .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")


@app.route("/")
def hello():
    dataframe = spark.read.format("jdbc").options(
        url="jdbc:postgresql://localhost/foodnome_dev?user=postgres&password=''",
        database="foodnome_test",
        dbtable='"Dishes"'
    ).load()

    print([row["description"]
           for row in dataframe.select('description').collect()])

    return "hello"

To run this application, I use JDBC driver with spark-submit :为了运行这个应用程序,我使用带有spark-submit JDBC 驱动程序:

../bin/spark-submit --master spark://Leos-MacBook-Pro.local:7077 --driver-class-path postgresql-42.2.5.jar server.py

What error do I get?我得到什么错误?

On Flask side, the error is Internal Server Error.在 Flask 方面,错误是内部服务器错误。 On Spark side,在 Spark 方面,

File "/Users/leoqiu/Desktop/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o36.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.0.0.67, executor 0): java.lang.ClassNotFoundException: org.postgresql.Driver
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:55)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)

--driver-class-path is not sufficient here. --driver-class-path在这里是不够的。 The driver jar should added to the executor class path as well.驱动程序 jar 也应该添加到执行程序类路径中。 This is typically handled together using:这通常使用以下方法一起处理:

  • spark.jars.packages / --packages spark.jars.packages / --packages
  • spark.jars / --jars spark.jars / --jars

though you can still use spark.executor.extraClassPath .尽管您仍然可以使用spark.executor.extraClassPath

Explanation :说明

With JDBC source driver is responsible for reading metadata (schema) and the executors for the actual data retrieve process.带有 JDBC 源驱动程序负责读取元数据(schema)和用于实际数据检索过程的执行程序。

This behavior is common to different external data sources, so whenever you use non-built-in format, you should distribute corresponding jars across the cluster.这种行为对于不同的外部数据源是通用的,因此无论何时使用非内置格式,都应该在整个集群中分发相应的 jar。

See also也可以看看

How to use JDBC source to write and read data in (Py)Spark? 如何使用JDBC源在(Py)Spark中读写数据?

Here is what worked for me, as suggested.正如建议的那样,这对我有用。 It needs --jars它需要--jars

../bin/spark-submit --master spark://Leos-MacBook-Pro.local:7077 --driver-class-path postgresql-42.2.5.jar --jars postgresql-42.2.5.jar server.py

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM