I'm a beginner of Scikit-learn. If I run the code for download "the 20 newsgroups text dataset" of sklearn.datasets (The code is shown at https://scikit-learn.org/stable/datasets/real_world.html )
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
the following error is returned.
OSError Traceback (most recent call last)
<ipython-input-17-ade32d7dd81b> in <module>
1 from sklearn.datasets import fetch_20newsgroups
----> 2 newsgroups_train = fetch_20newsgroups(subset='train')
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
~\anaconda3\lib\site-packages\sklearn\datasets\_twenty_newsgroups.py in fetch_20newsgroups(data_home, subset, categories, shuffle, random_state, remove, download_if_missing, return_X_y)
257 logger.info("Downloading 20news dataset. "
258 "This may take a few minutes.")
--> 259 cache = _download_20newsgroups(target_dir=twenty_home,
260 cache_path=cache_path)
261 else:
~\anaconda3\lib\site-packages\sklearn\datasets\_twenty_newsgroups.py in _download_20newsgroups(target_dir, cache_path)
73
74 logger.info("Downloading dataset from %s (14 MB)", ARCHIVE.url)
---> 75 archive_path = _fetch_remote(ARCHIVE, dirname=target_dir)
76
77 logger.debug("Decompressing %s", archive_path)
~\anaconda3\lib\site-packages\sklearn\datasets\_base.py in _fetch_remote(remote, dirname)
1195 checksum = _sha256(file_path)
1196
-> 1197 if remote.checksum != checksum:
1198 raise IOError("{} has an SHA256 checksum ({}) "
1199 "differing from expected ({}), "
OSError: C:\Users\owner\scikit_learn_data\20news_home\20news-bydate.tar.gz has an SHA256 checksum (cb5c6e663e59b628d9016d3cb2a3992ad38811d846c04561c3fbfa58badcb1f7) differing from expected (8f1b2514ca22a5ade8fbb9cfa5727df95fa587f4c87b786e15c759fa66d95610), file may be corrupted.
The downloaded file size (C:\\Users\\owner\\scikit_learn_data\\20news_home\\20news-bydate.tar.gz) is 1KB. However the real size of the file is about 14MB ( http://qwone.com/~jason/20Newsgroups/ ).
Why fetch(downloading) does failed and how can I success downloading the file with 'fetch_20newsgroups'?
My OS is Windows10
Many thanks.
I found the reason. The reason is that our company blocked amazon website for the reason of security. so the downloading is failed. The 20 newsgroups text dataset maybe saved in amazon and scikit-learn module takes the data from it. Message from our company shows that 's3-eu-west-1.amazonaws.com/pfigshare-u-files' and 's3-eu-west-1.amazonaws.com/' are blocked.
Thanks to Kota Mori . Your answer gives me some hint. The URL is 'https://ndownloader.figshare.com/files/5975967' and if I copy it to web-browser, the address is changed to 'https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/5975967/20newsbydate.tar.gz?...' and blocked image shows up.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.