[英]Using Spark with Flask with JDBC
What am I doing?我在做什么?
I want to build an API service using Flask to extract data from one database, do some data analysis and then load the new data into a separate DB.我想使用 Flask 构建一个 API 服务,从一个数据库中提取数据,进行一些数据分析,然后将新数据加载到一个单独的数据库中。
What is wrong?怎么了?
If I run Spark by itself, I can access the db, perform analysis and load to db.如果我自己运行 Spark,我可以访问数据库,执行分析并加载到数据库。 But the same functions would not work when using them in a Flask application(api routes).但是在 Flask 应用程序(api 路由)中使用它们时,相同的功能将不起作用。
How am I doing it?我是怎么做的?
First I start the Spark master and worker.首先我启动 Spark master 和 worker。 I can see I have one worker in localhost:8080
under the master.我可以看到我在localhost:8080
的主人下有一个工人。
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
../sbin/start-master.sh
../sbin/start-slave.sh spark://xxx.local:7077
For the Flask application:对于 Flask 应用程序:
app = Flask(__name__)
spark = SparkSession\
.builder\
.appName("Flark - Flask on Spark")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
@app.route("/")
def hello():
dataframe = spark.read.format("jdbc").options(
url="jdbc:postgresql://localhost/foodnome_dev?user=postgres&password=''",
database="foodnome_test",
dbtable='"Dishes"'
).load()
print([row["description"]
for row in dataframe.select('description').collect()])
return "hello"
To run this application, I use JDBC driver with spark-submit
:为了运行这个应用程序,我使用带有spark-submit
JDBC 驱动程序:
../bin/spark-submit --master spark://Leos-MacBook-Pro.local:7077 --driver-class-path postgresql-42.2.5.jar server.py
What error do I get?我得到什么错误?
On Flask side, the error is Internal Server Error.在 Flask 方面,错误是内部服务器错误。 On Spark side,在 Spark 方面,
File "/Users/leoqiu/Desktop/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o36.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.0.0.67, executor 0): java.lang.ClassNotFoundException: org.postgresql.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:55)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
--driver-class-path
is not sufficient here. --driver-class-path
在这里是不够的。 The driver jar should added to the executor class path as well.驱动程序 jar 也应该添加到执行程序类路径中。 This is typically handled together using:这通常使用以下方法一起处理:
spark.jars.packages
/ --packages
spark.jars.packages
/ --packages
spark.jars
/ --jars
spark.jars
/ --jars
though you can still use spark.executor.extraClassPath
.尽管您仍然可以使用spark.executor.extraClassPath
。
Explanation :说明:
With JDBC source driver is responsible for reading metadata (schema) and the executors for the actual data retrieve process.带有 JDBC 源驱动程序负责读取元数据(schema)和用于实际数据检索过程的执行程序。
This behavior is common to different external data sources, so whenever you use non-built-in format, you should distribute corresponding jars across the cluster.这种行为对于不同的外部数据源是通用的,因此无论何时使用非内置格式,都应该在整个集群中分发相应的 jar。
See also也可以看看
How to use JDBC source to write and read data in (Py)Spark? 如何使用JDBC源在(Py)Spark中读写数据?
Here is what worked for me, as suggested.正如建议的那样,这对我有用。 It needs --jars
它需要--jars
../bin/spark-submit --master spark://Leos-MacBook-Pro.local:7077 --driver-class-path postgresql-42.2.5.jar --jars postgresql-42.2.5.jar server.py
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.