How can I success to download(with “fetch”) the real world datasets of scikit-learn?

Question

I'm a beginner of Scikit-learn. If I run the code for download "the 20 newsgroups text dataset" of sklearn.datasets (The code is shown at https://scikit-learn.org/stable/datasets/real_world.html )

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

the following error is returned.

OSError                                   Traceback (most recent call last)
<ipython-input-17-ade32d7dd81b> in <module>
      1 from sklearn.datasets import fetch_20newsgroups
----> 2 newsgroups_train = fetch_20newsgroups(subset='train')

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~\anaconda3\lib\site-packages\sklearn\datasets\_twenty_newsgroups.py in fetch_20newsgroups(data_home, subset, categories, shuffle, random_state, remove, download_if_missing, return_X_y)
    257             logger.info("Downloading 20news dataset. "
    258                         "This may take a few minutes.")
--> 259             cache = _download_20newsgroups(target_dir=twenty_home,
    260                                            cache_path=cache_path)
    261         else:

~\anaconda3\lib\site-packages\sklearn\datasets\_twenty_newsgroups.py in _download_20newsgroups(target_dir, cache_path)
     73 
     74     logger.info("Downloading dataset from %s (14 MB)", ARCHIVE.url)
---> 75     archive_path = _fetch_remote(ARCHIVE, dirname=target_dir)
     76 
     77     logger.debug("Decompressing %s", archive_path)

~\anaconda3\lib\site-packages\sklearn\datasets\_base.py in _fetch_remote(remote, dirname)
   1195     checksum = _sha256(file_path)
   1196 
-> 1197     if remote.checksum != checksum:
   1198         raise IOError("{} has an SHA256 checksum ({}) "
   1199                       "differing from expected ({}), "

OSError: C:\Users\owner\scikit_learn_data\20news_home\20news-bydate.tar.gz has an SHA256 checksum (cb5c6e663e59b628d9016d3cb2a3992ad38811d846c04561c3fbfa58badcb1f7) differing from expected (8f1b2514ca22a5ade8fbb9cfa5727df95fa587f4c87b786e15c759fa66d95610), file may be corrupted.

The downloaded file size (C:\\Users\\owner\\scikit_learn_data\\20news_home\\20news-bydate.tar.gz) is 1KB. However the real size of the file is about 14MB ( http://qwone.com/~jason/20Newsgroups/ ).

Why fetch(downloading) does failed and how can I success downloading the file with 'fetch_20newsgroups'?

My OS is Windows10

Many thanks.

Answer 1

I found the reason. The reason is that our company blocked amazon website for the reason of security. so the downloading is failed. The 20 newsgroups text dataset maybe saved in amazon and scikit-learn module takes the data from it. Message from our company shows that 's3-eu-west-1.amazonaws.com/pfigshare-u-files' and 's3-eu-west-1.amazonaws.com/' are blocked.

Thanks to Kota Mori . Your answer gives me some hint. The URL is 'https://ndownloader.figshare.com/files/5975967' and if I copy it to web-browser, the address is changed to 'https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/5975967/20newsbydate.tar.gz?...' and blocked image shows up.

How can I success to download(with “fetch”) the real world datasets of scikit-learn?

Question

1 answers

solution1
0 2021-07-26 06:53:38

How can I success to download(with “fetch”) the real world datasets of scikit-learn?

Question

1 answers

solution1 0 2021-07-26 06:53:38

solution1
0 2021-07-26 06:53:38