Python spark get stuck on rdd.collect

Question

I am new in the Spark world. I am test Spark on my local machine with pyspark. I have created the following script, but when it reaches the rdd.collect() method, it simply get stuck.

sparkSession = SparkSession.builder.appName("SimpleApp")\
            .getOrCreate()

_data_frame_reader_ = sparkSession.read.format("jdbc").option("url", url) \
  .option("user", user) \
  .option("password", password) \
  .option("driver", "oracle.jdbc.driver.OracleDriver")

mytable = _data_frame_reader_.option("dbtable", 'my_test_table')
mytable .registerTempTable("my_test_table")

sql = 'SELECT * from my_test_table'

df = sparkSession.sql(sql)

for row in df.rdd.collect():
    # do some operation

My table has about only 50 records. I am able to connect to my database through SQLDeveloper.

Now I am trying to execute this code through Jupyter notebook. It logs no error, simply stay executing forever.

I could not figure out what is going on yet.

Thank you for your time!

Answer 1

I figured out what was happening. My table have only 50 records, but it has FKs with other tables, that have a lot of lines. I let the job running for more than 30 minutes and it didn't finished. I did the following:

1 - Added a fetch size in the DB configuration:

_data_frame_reader_ = sparkSession.read.format("jdbc").option("url", url) \
  .option("user", user) \
  .option("password", password) \
  .option("fetchsize", "10000") \
  .option("driver", "oracle.jdbc.driver.OracleDriver")

This will increase the load performance. See this documentation .

2 - I have tuned my queries to fetch only the records that I need, including some joins and creating wheres in others tables to filter my dependent lines too.

Now my job is running in less than 2 minutes.

Python spark get stuck on rdd.collect

Question

1 answers

solution1
0 ACCPTED 2019-07-15 10:55:30

Python spark get stuck on rdd.collect

Question

1 answers

solution1 0 ACCPTED 2019-07-15 10:55:30

solution1
0 ACCPTED 2019-07-15 10:55:30