[英]Progress for downloading large CSV files from Internet using Python
I am reading McKinney's Data Analysis book, and he has shared 150MB file. 我正在阅读McKinney的数据分析书,他已经分享了150MB的文件。 Although this topic has been discussed extensively at Progress Bar while download file over http with Requests , I am finding that the code in accepted answer is throwing an error.
尽管在使用请求通过http下载文件时 , Progress Bar已经广泛讨论了这个主题,但我发现接受的答案中的代码引发了错误。 I am a beginner, so I am unable to resolve this.
我是初学者,所以我无法解决这个问题。
I want to download the following file: 我想下载以下文件:
https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/fec/P00000001-ALL.csv
Here's the code without progress bar: 这是没有进度条的代码:
DATA_PATH='./Data'
filename = "P00000001-ALL.csv"
url_without_filename = "https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/fec"
url_with_filename = url_without_filename + "/" + filename
local_filename = DATA_PATH + '/' + filename
#Write the file on local disk
r = requests.get(url_with_filename) #without streaming
with open(local_filename, 'w', encoding=r.encoding) as f:
f.write(r.text)
This works well, but because there is no progress bar, I wonder what's going on. 这很好用,但因为没有进度条,我想知道发生了什么。
Here's the code adapted from Progress Bar while download file over http with Requests and How to download large file in python with requests.py? 这里是从Progress Bar改编的代码, 同时通过http下载文件和请求以及如何使用requests.py在python中下载大文件?
#Option 2:
#Write the file on local disk
r = requests.get(url_with_filename, stream=True) # added stream parameter
total_size = int(r.headers.get('content-length', 0))
with open(local_filename, 'w', encoding=r.encoding) as f:
#f.write(r.text)
for chunk in tqdm(r.iter_content(1024), total=total_size, unit='B', unit_scale=True):
if chunk:
f.write(chunk)
There are two problems with the second option (ie with streaming and tqdm
package): 第二个选项存在两个问题 (即使用流式传输和
tqdm
包):
a) The file size isn't calculated correctly. a)文件大小未正确计算。 The actual size is 157MB, but the
total_size
turns out to be 25MB. 实际大小为157MB,但
total_size
为25MB。
b) Even bigger problem than a) is that I get the following error: b)比a)更大的问题是我得到以下错误:
0%| | 0.00/24.6M [00:00<?, ?B/s] Traceback (most recent call last): File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3265, in run_code
exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-31-abbe9270092b>", line 6, in <module>
f.write(data) TypeError: write() argument must be str, not bytes
As a beginner, I am unsure how to solve these two issues. 作为初学者,我不确定如何解决这两个问题。 I spent a lot of time going through git page of
tqdm
, but I couldn't follow it. 我花了很多时间浏览
tqdm
git页面,但我无法遵循它。 I'd appreciate any help. 我很感激任何帮助。
I am assuming that the readers know that we need to import requests
and tqdm
. 我假设读者知道我们需要导入
requests
和tqdm
。 So, I haven't included the code for importing these basic packages. 所以,我没有包含导入这些基本包的代码。
Here's the code for those who are curious:
以下是那些好奇的人的代码:
with open(local_filename, 'wb') as f:
r = requests.get(url_with_filename, stream=True) # added stream parameter
# total_size = int(r.headers.get('content-length', 0))
local_filename = DATA_PATH + '/' + filename
total_size = len(r.content)
downloaded = 0
# chunk_size = max(1024*1024,int(total_size/1000))
chunk_size = 1024
#for chunk in tqdm(r.iter_content(chunk_size=chunk_size),total=total_size,unit='KB',unit_scale=True):
for chunk in r.iter_content(chunk_size=chunk_size):
downloaded += len(chunk)
a=f.write(chunk)
done = int(50 * downloaded/ total_size)
sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50 - done)))
sys.stdout.flush()
As the error says : 正如错误所说:
write() argument must be str, not bytes
write()参数必须是str,而不是bytes
so just convert chunk to string : 所以只需将块转换为字符串 :
f.write(str(chunk))
Note: Instead I would suggest to write to a .bin file and then convert it to .csv 注意:相反,我建议写入.bin文件,然后将其转换为.csv
Try writing with wb
instead of just w
. 尝试用
wb
而不是w
。
with open( local_filename, 'wb', encoding= r.encoding ) as f:
f.write( r.text )
with open(filename, 'wb', encoding=r.encoding) as f:
f.write(r.content)
This should fix your writing problem. 这应该可以解决你的写作问题。 Write
r.content
not r.text
Since type(r.content)
is <class 'bytes'>
which is what you need to write in the file 写
r.content
而不是r.text
因为type(r.content)
是<class 'bytes'>
type(r.content)
<class 'bytes'>
这是你需要在文件中写的
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.