简体繁体中英

Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?

原文 2021-12-13 16:42:13 9 1 palantir-foundry/ foundry-code-repositories/ foundry-python-transform

I want to run df.count() on my DataFrame, but I know my total dataset size is pretty large. Does this run the risk of materializing the data back to the driver / increasing my risk of driver OOM ?

1 answers

This will not materialize your entire dataset to the driver, nor will it necessarily increase your risk of OOM. (It forces the evaluation of the incoming DataFrame, so if that evaluation means you will OOM then that will be recognized at the point you .count() , but the .count() itself didn't cause this, it only made you realize it).

What this will do, however, is halt the execution of your job from the point you make the call to. .count() . This is because this value must be known to the driver before it can proceed with any of the rest of your work, so it's not a particularly efficient use of Spark / distributed compute. Use .count() only when necessary, ie when making choices about partition counts or other such dynamic sizing operations.

What does it mean when a dataset build "OOM"s?

Why does my build with executor cores specified OOM?

How to writeback to dataframe using transform_df in palantir foundry?

How to use when and Otherwise statement for a Spark dataframe by boolean columns?

In Foundry Code Repositories, how do I iterate over all datasets in a directory?

How do I turn a list of JSON objects into a Spark dataframe in Code Workbook?

Transforming shapefiles to dataframes with shapefile_to_dataframe() helper function - fiona related error

Does Foundry Data Connection support NFS transfer?

Is it possible to sync data via Foundry Data Connection to a specific branch of a dataset?

PySpark Serialized Results too Large OOM for loop in Spark

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question What does it mean when a dataset build "OOM"s? Why does my build with executor cores specified OOM? How to writeback to dataframe using transform_df in palantir foundry? How to use when and Otherwise statement for a Spark dataframe by boolean columns? In Foundry Code Repositories, how do I iterate over all datasets in a directory? How do I turn a list of JSON objects into a Spark dataframe in Code Workbook? Transforming shapefiles to dataframes with shapefile_to_dataframe() helper function - fiona related error Does Foundry Data Connection support NFS transfer? Is it possible to sync data via Foundry Data Connection to a specific branch of a dataset? PySpark Serialized Results too Large OOM for loop in Spark

Related Tags

Does a count() over a DataFrame materialize the data to the driver / increase a risk of OOM?

Question

1 answers

solution1 0 2021-12-13 16:42:13

solution1
0 2021-12-13 16:42:13