简体   繁体   中英

From Spark to Pandas Dataframe iteratively

Have a data in Spark and want to transform it into a Pandas dataframe for further analysis. Doing just this:

dataset = sqlContext.sql('SELECT * FROM TEMP')

df=dataset.toPandas()

But the table seems to be quite big and there is a lot of time spending during Pandas processing.

Does toPandas () function have attributes like iterations or chunk size (like in read_csv in pandas) for doing the transfering data iteratively for increasing performance?

Thanks!

There are no options for the toPandas() method itself. Take a look at the source for the function here .

As the commenters have mentioned (and is pointed out in the docstring for the method) you are at risk of being physically unable to do this, and I'm not sure what you could do with pandas that you couldn't figure out how to do in spark.

If you really do want to do something in Python with chunks from Spark data, then your best bet is the write the data out to the filesystem as CSV then read it back in chunks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM