简体   繁体   中英

Importing Data from github using Python in a Jupyter notebook

I'm using the book "Hands-on machine learning with scikit-learn and tensorflow" by Aurelien Geron.

It's my first time using Jupyter and Python.

I'm trying to follow the following code. 在此处输入图片说明

My problem is when I run the cell with this code:

import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

The cell evaluation never ends, with the In[*]: never becoming something like In[1]: .

So, I thought it was a problem with the initial url, because it showed an error when I visited it through my internet browser.

Hence, I changed it to DOWNLOAD_ROOT = "https://github.com/ageron/handson-ml2/tree/master/" .

Now I get In[1]: . However, when I run fetch_housing_data() , I get:

---------------------------------------------------------------------------
ReadError                                 Traceback (most recent call last)
<ipython-input-6-bd66b1fe6daf> in <module>
----> 1 fetch_housing_data()

<ipython-input-5-ef3c39b342d8> in fetch_housing_data(housing_url, housing_path)
      9     tgz_path = os.path.join(housing_path, "housing.tgz")
     10     urllib.request.urlretrieve(housing_url, tgz_path)
---> 11     housing_tgz = tarfile.open(tgz_path)
     12     housing_tgz.extractall(path=housing_path)
     13     housing_tgz.close()

~\Anaconda3\lib\tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1576                         fileobj.seek(saved_pos)
   1577                     continue
-> 1578             raise ReadError("file could not be opened successfully")
   1579 
   1580         elif ":" in mode:

ReadError: file could not be opened successfully

Why does this happen, and how can I solve this?

Have you restarted your kernel and tried running again?
What you are seeing isn't reproducible.

The first code block you pasted above works as written. No need to modify it.
I just ran this below and then when I ran fetch_housing_data() in another cell it worked:

import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

You sure it isn't just an artifact that you don't see the cell completing?
If you want to independently verify, you can run it elsewhere like I did. I just tested it by going here and pressing the bottom launch binder link. Then I pasted your code in a cell that comes up. After running those two cells, I have a directory at /home/jovyan/scripts/datasets/housing with the contents housing.csv housing.tgz .

https://raw.githubusercontent.com/ageron/handson-ml2/master/

I am not sure what kind of a link this is. Maybe someone can explain. When I just type in this link, I cannot access the page. However it does work for retrieving the data which I explained in the next praragraph. If I use the actual github link https://github.com/ageron/handson-ml2/tree/master/ that you mentioned above, the code is unable to extract the data.

I have been able to extract the csv file from the link using the steps in the book by adding another line in the 'imports'. I added "import urllib.request". This seems to work for me on Google Colab. Importing urllib you would think that urllib.request is also imported but that's not the case. I cannot answer why it is working but documentation for urllib had 'import urllib.request' in one example and I took the idea.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM