From Spark to Pandas Dataframe iteratively

Question

Have a data in Spark and want to transform it into a Pandas dataframe for further analysis. Doing just this:

dataset = sqlContext.sql('SELECT * FROM TEMP')

df=dataset.toPandas()

But the table seems to be quite big and there is a lot of time spending during Pandas processing.

Does toPandas () function have attributes like iterations or chunk size (like in read_csv in pandas) for doing the transfering data iteratively for increasing performance?

Thanks!

Answer 1

There are no options for the toPandas() method itself. Take a look at the source for the function here .

As the commenters have mentioned (and is pointed out in the docstring for the method) you are at risk of being physically unable to do this, and I'm not sure what you could do with pandas that you couldn't figure out how to do in spark.

Answer 2

If you really do want to do something in Python with chunks from Spark data, then your best bet is the write the data out to the filesystem as CSV then read it back in chunks.

From Spark to Pandas Dataframe iteratively

Question

2 answers

solution1
0 ACCPTED 2017-02-03 17:11:44

solution2
0 2020-09-26 14:55:22

From Spark to Pandas Dataframe iteratively

Question

2 answers

solution1 0 ACCPTED 2017-02-03 17:11:44

solution2 0 2020-09-26 14:55:22

solution1
0 ACCPTED 2017-02-03 17:11:44

solution2
0 2020-09-26 14:55:22