I am new in the Spark world. I am test Spark on my local machine with pyspark. I have created the following script, but when it reaches the rdd.collect()
method, it simply get stuck.
sparkSession = SparkSession.builder.appName("SimpleApp")\
.getOrCreate()
_data_frame_reader_ = sparkSession.read.format("jdbc").option("url", url) \
.option("user", user) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver")
mytable = _data_frame_reader_.option("dbtable", 'my_test_table')
mytable .registerTempTable("my_test_table")
sql = 'SELECT * from my_test_table'
df = sparkSession.sql(sql)
for row in df.rdd.collect():
# do some operation
My table has about only 50 records. I am able to connect to my database through SQLDeveloper.
Now I am trying to execute this code through Jupyter notebook. It logs no error, simply stay executing forever.
I could not figure out what is going on yet.
Thank you for your time!
I figured out what was happening. My table have only 50 records, but it has FKs with other tables, that have a lot of lines. I let the job running for more than 30 minutes and it didn't finished. I did the following:
1 - Added a fetch size in the DB configuration:
_data_frame_reader_ = sparkSession.read.format("jdbc").option("url", url) \
.option("user", user) \
.option("password", password) \
.option("fetchsize", "10000") \
.option("driver", "oracle.jdbc.driver.OracleDriver")
This will increase the load performance. See this documentation .
2 - I have tuned my queries to fetch only the records that I need, including some joins and creating wheres
in others tables to filter my dependent lines too.
Now my job is running in less than 2 minutes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.