I would like to understand the difference between the RAM
and storage
in Azure databricks.
Suppose I am reading csv data from the Azure data lake (ADLS Gen 2) as follows:
df = spark.read.csv("path to the csv file").collect()
I am aware that the read
method in spark is a Transformation
method in spark. And this is not going to be run immediately. However, now if I perform an Action
using the collect()
method, I would assume that the data is now actually been read from the data lake by Spark and loaded into RAM
or Disk
. First, I would like to know, where is the data stored. Is it in RAM
or in Disk
. And, if the data is stored in RAM
, then what is cache
used for?; and if the data is retrieved and stored on disk
, then what does persist do? I am aware that cache
stores the data in memory for late use, and that if I have very large amount of data, I can use persist
to store the data into a disk
.
I would like to know, how much can databricks scale if we have peta bytes of data?
RAM
and Disk
differ in size? Please note that I am newbie to Azure Databricks and Spark.
I would like to get some recommendation on the best practices when using Spark.
Your help is much appreciated!!
First, I would like to know, where is the data stored.
When you run any action (ie collect or others) Data is collected from executors nodes to driver node and stored in ram (memory)
And, if the data is stored in RAM, then what is cache used for
Spark has lazy evaluation
what does that mean is until you call an action it doesn't do anything, and once you call it, it created a DAG
and then executed that DAF.
Let's understand it by an example. let's consider you have table three tables Table A
, Table B
and Table C
you have joined this table and apply some business logic (maps and filters), let's call this dataframe filtered_data
. and now you are using this DataFrame
in let's say 5 different places ( another dataframes) for either lookup or join and other business reason.
if you want persist your filterd_data
dataframe, everytime it will be refreshed, it will again go through joins and other business logic. So it's advisable to persist dataframe if you are going to use that into multiple places.
By Default Cache
stored data in memory (RAM) but you can set the storage level to disk
would like to know, how much can databricks scale if we have petabytes of data?
It's a distributed environment, so what you need to do is add more executors. and may be need to increase the memory and CPU configuration,
how can I know where the data is stored at any point in time?
if you haven't created a table or view, it's stored in memory.
What is the underlying operating system running Azure Databricks?
it uses linux
operation system. specifically Linux-4.15.0-1050-azure-x86_64-with-Ubuntu-16.04-xenial
you can run the following command to know.
import platform
println(platform.platform()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.