简体   繁体   English

在spark / scala中迭代一个巨大的数据框

[英]Iterating a huge data frame in spark/scala

I have a dataframe with 500 million rows. 我有一个有5亿行的数据帧。 I would like to iterate through each row and modify the column names/drop few columns and also update the column values based on few conditions. 我想迭代每一行并修改列名/删除几列,并根据几个条件更新列值。 I am using the below approach with collect. 我正在使用以下方法收集。

df.collect.foreach(row => mycustomeMethod()) df.collect.foreach(row => mycustomeMethod())

As collect will bring all the data to the driver i am faces out of memory errors.Can you please suggest any alternate ways of achieving the same. 由于收集将把所有数据带到驱动程序我面临内存错误。你可以建议任何其他方法来实现相同的。

We are using spark-cassandra connector by datastax. 我们通过datastax使用spark-cassandra连接器。 I tried different approaches but nothing that helped to improve the performance. 我尝试了不同的方法但没有任何有助于提高性能的方法。

Use a map operation instead of a collect/foreach, and convert back to RDD. 使用map操作而不是collect / foreach,然后转换回RDD。 That will allow the calculations to be distributed around the cluster, instead of forcing it all into one node. 这将允许计算分布在集群周围,而不是将其全部强制转换为一个节点。 You can do this by modifying your custom method to take and return a Row, which can then be converted back to a DataFrame. 您可以通过修改自定义方法来获取并返回一行来执行此操作,然后可以将其转换回DataFrame。

val oldSchema = originalDf.schema
val newSchema = //TODO: put new schema based on what you want to do
val newRdd = originalDf.map(row => myCustomMethod(row))
val newDf = sqlContext.createDataFrame(newRdd,newSchema)

Dropping rows can then be handled through the .drop method on the new DataFrame. 然后可以通过新DataFrame上的.drop方法处理删除行。

This may run into problems if your custom method is not serializable - or rather contains objects that are not serializable - in which case switch to a mapPartitions method, so that you can force each node to create a copy of the relevant objects first. 如果您的自定义方法不可序列化 - 或者包含不可序列化的对象 - 在这种情况下切换到mapPartitions方法,这可能会遇到问题,因此您可以强制每个节点首先创建相关对象的副本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM