简体   繁体   English

从 Spark 迭代到 Pandas Dataframe

[英]From Spark to Pandas Dataframe iteratively

Have a data in Spark and want to transform it into a Pandas dataframe for further analysis.在 Spark 中有一个数据,并希望将其转换为 Pandas 数据框以供进一步分析。 Doing just this:这样做:

dataset = sqlContext.sql('SELECT * FROM TEMP')

df=dataset.toPandas()

But the table seems to be quite big and there is a lot of time spending during Pandas processing.但是桌子好像很大,Pandas处理的时候耗费的时间也很多。

Does toPandas () function have attributes like iterations or chunk size (like in read_csv in pandas) for doing the transfering data iteratively for increasing performance? toPandas () 函数是否具有迭代或块大小(如 pandas 中的 read_csv)之类的属性,用于迭代传输数据以提高性能?

Thanks!谢谢!

There are no options for the toPandas() method itself. toPandas()方法本身没有选项。 Take a look at the source for the function here .此处查看函数的源代码。

As the commenters have mentioned (and is pointed out in the docstring for the method) you are at risk of being physically unable to do this, and I'm not sure what you could do with pandas that you couldn't figure out how to do in spark.正如评论者所提到的(并在该方法的文档字符串中指出)你有身体上无法做到这一点的风险,而且我不确定你可以用你无法弄清楚如何做的熊猫做些什么在火花中做。

If you really do want to do something in Python with chunks from Spark data, then your best bet is the write the data out to the filesystem as CSV then read it back in chunks.如果你真的想在 Python 中使用 Spark 数据块做一些事情,那么你最好的选择是将数据以 CSV 格式写入文件系统,然后以块的形式读回。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM