Error while installing Spark on Google Colab

I am getting error while installing spark on Google Colab. It says

tar: spark-2.2.1-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error is not recoverable: exiting now.

These were my steps

The problem is due to the download link you are using to download spark:


To download spark without having any problem, you should download it from their archive site ( https://archive.apache.org/dist/spark ):

For example the following download link from their archive works fine


Here is the complete code to install and setup java, spark and pyspark:

# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

# set your spark folder to your system path environment. 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

# install findspark using pip
!pip install -q findspark

For python users, you should also install pyspark using the following command.

!pip install pyspark

This error is about the link you've used in the second line of the code. The following snippet worked for me on the Google Colab. Do not forget to change the spark version to the latest one and SPARK-HOME path accordingly. You can find the latest versions here: https://downloads.apache.org/spark/

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark

This is the correct code. I just tested it.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://mirrors.viethosting.com/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
#for the most recent update on 02/29/2020

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop3.2

Just go to https://downloads.apache.org/spark/ and choose the version you need from the folders and follow instructions in https://colab.research.google.com/github/asifahmed90/pyspark-ML-in-Colab/blob/master/PySpark_Regression_Analysis.ipynb#scrollTo=m606eNuQgA82


  1. Go to https://downloads.apache.org/spark/
  2. Select folder for example: "spark-3.0.1/"
  3. Copy file name you want for example: "spark-3.0.1-bin-hadoop3.2.tgz" (ends with .tgz)
  4. Paste to the provided script

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/FOLDER_YOU_CHOSE/FILE_YOU_CHOSE
!tar -xvf FILE_YOU_CHOSE
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/FILE_YOU_CHOSE"

import findspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

I have tried the following commands and it seems to work.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark

I got the latest version, changed the download URL, and added the v flag to the tar command for verbose output.

you are using link for the old version , following commands will work(new version)

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark

To run spark in Colab, first we need to install all the dependencies in Colab environment such as Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark in order to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

if you get this error again : Cannot open: No such file or directory tar

visit Apache spark website and get the latest build version: 1. https://www-us.apache.org/dist/spark/ 2. http://apache.osuosl.org/spark/

replace spark- 2.4.3 bold words with latest version.

Spark version 2.3.2 works very well in google colab. Just follow my steps :

!pip install pyspark==2.3.2
import pyspark 

Check the version we have installed


Try to create a Sparksession

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Sparkify").getOrCreate()

And you can now use Spark in colab.


!pip install pyspark

It worked with just the !pip install pyspark. Please refer screen shot for reference.


