Importing Data from github using Python in a Jupyter notebook

Question

I'm using the book "Hands-on machine learning with scikit-learn and tensorflow" by Aurelien Geron.

It's my first time using Jupyter and Python.

I'm trying to follow the following code.

My problem is when I run the cell with this code:

import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

The cell evaluation never ends, with the In[*]: never becoming something like In[1]: .

So, I thought it was a problem with the initial url, because it showed an error when I visited it through my internet browser.

Hence, I changed it to DOWNLOAD_ROOT = "https://github.com/ageron/handson-ml2/tree/master/" .

Now I get In[1]: . However, when I run fetch_housing_data() , I get:

---------------------------------------------------------------------------
ReadError                                 Traceback (most recent call last)
<ipython-input-6-bd66b1fe6daf> in <module>
----> 1 fetch_housing_data()

<ipython-input-5-ef3c39b342d8> in fetch_housing_data(housing_url, housing_path)
      9     tgz_path = os.path.join(housing_path, "housing.tgz")
     10     urllib.request.urlretrieve(housing_url, tgz_path)
---> 11     housing_tgz = tarfile.open(tgz_path)
     12     housing_tgz.extractall(path=housing_path)
     13     housing_tgz.close()

~\Anaconda3\lib\tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1576                         fileobj.seek(saved_pos)
   1577                     continue
-> 1578             raise ReadError("file could not be opened successfully")
   1579 
   1580         elif ":" in mode:

ReadError: file could not be opened successfully

Why does this happen, and how can I solve this?

Answer 1

Have you restarted your kernel and tried running again?
What you are seeing isn't reproducible.

The first code block you pasted above works as written. No need to modify it.
I just ran this below and then when I ran fetch_housing_data() in another cell it worked:

import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

You sure it isn't just an artifact that you don't see the cell completing?
If you want to independently verify, you can run it elsewhere like I did. I just tested it by going here and pressing the bottom launch binder link. Then I pasted your code in a cell that comes up. After running those two cells, I have a directory at /home/jovyan/scripts/datasets/housing with the contents housing.csv housing.tgz .

Answer 2

https://raw.githubusercontent.com/ageron/handson-ml2/master/

I am not sure what kind of a link this is. Maybe someone can explain. When I just type in this link, I cannot access the page. However it does work for retrieving the data which I explained in the next praragraph. If I use the actual github link https://github.com/ageron/handson-ml2/tree/master/ that you mentioned above, the code is unable to extract the data.

I have been able to extract the csv file from the link using the steps in the book by adding another line in the 'imports'. I added "import urllib.request". This seems to work for me on Google Colab. Importing urllib you would think that urllib.request is also imported but that's not the case. I cannot answer why it is working but documentation for urllib had 'import urllib.request' in one example and I took the idea.

Importing Data from github using Python in a Jupyter notebook

Question

2 answers

solution1
0 ACCPTED 2020-02-11 18:20:12

solution2
-2 2021-07-24 00:32:48

Importing Data from github using Python in a Jupyter notebook

Question

2 answers

solution1 0 ACCPTED 2020-02-11 18:20:12

solution2 -2 2021-07-24 00:32:48

solution1
0 ACCPTED 2020-02-11 18:20:12

solution2
-2 2021-07-24 00:32:48