使用Python从Internet下载大型CSV文件的进度

Question

I am reading McKinney's Data Analysis book, and he has shared 150MB file. 我正在阅读McKinney的数据分析书，他已经分享了150MB的文件。 Although this topic has been discussed extensively at Progress Bar while download file over http with Requests , I am finding that the code in accepted answer is throwing an error. 尽管在使用请求通过http下载文件时， Progress Bar已经广泛讨论了这个主题，但我发现接受的答案中的代码引发了错误。 I am a beginner, so I am unable to resolve this. 我是初学者，所以我无法解决这个问题。

I want to download the following file: 我想下载以下文件：

https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/fec/P00000001-ALL.csv

Here's the code without progress bar: 这是没有进度条的代码：

DATA_PATH='./Data'
filename = "P00000001-ALL.csv"
url_without_filename = "https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/fec"

url_with_filename = url_without_filename + "/" + filename
local_filename = DATA_PATH + '/' + filename

#Write the file on local disk
r = requests.get(url_with_filename)  #without streaming
with open(local_filename, 'w', encoding=r.encoding) as f:
    f.write(r.text)

This works well, but because there is no progress bar, I wonder what's going on. 这很好用，但因为没有进度条，我想知道发生了什么。

Here's the code adapted from Progress Bar while download file over http with Requests and How to download large file in python with requests.py? 这里是从Progress Bar改编的代码，同时通过http下载文件和请求以及如何使用requests.py在python中下载大文件？

#Option 2:
#Write the file on local disk
r = requests.get(url_with_filename, stream=True)  # added stream parameter
total_size = int(r.headers.get('content-length', 0))

with open(local_filename, 'w', encoding=r.encoding) as f:
    #f.write(r.text)
    for chunk in tqdm(r.iter_content(1024), total=total_size, unit='B', unit_scale=True):
        if chunk:
            f.write(chunk)

There are two problems with the second option (ie with streaming and tqdm package): 第二个选项存在两个问题 （即使用流式传输和tqdm包）：

a) The file size isn't calculated correctly. a）文件大小未正确计算。 The actual size is 157MB, but the total_size turns out to be 25MB. 实际大小为157MB，但total_size为25MB。

b) Even bigger problem than a) is that I get the following error: b）比a）更大的问题是我得到以下错误：

 0%|          | 0.00/24.6M [00:00<?, ?B/s] Traceback (most recent call last):   File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3265, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)   File "<ipython-input-31-abbe9270092b>", line 6, in <module>
    f.write(data) TypeError: write() argument must be str, not bytes

As a beginner, I am unsure how to solve these two issues. 作为初学者，我不确定如何解决这两个问题。 I spent a lot of time going through git page of tqdm , but I couldn't follow it. 我花了很多时间浏览tqdm git页面，但我无法遵循它。 I'd appreciate any help. 我很感激任何帮助。

I am assuming that the readers know that we need to import requests and tqdm . 我假设读者知道我们需要导入requests和tqdm 。 So, I haven't included the code for importing these basic packages. 所以，我没有包含导入这些基本包的代码。

Here's the code for those who are curious: 以下是那些好奇的人的代码：

with open(local_filename, 'wb') as f:
    r = requests.get(url_with_filename, stream=True)  # added stream parameter
    # total_size = int(r.headers.get('content-length', 0))
    local_filename = DATA_PATH + '/' + filename
    total_size = len(r.content)
    downloaded = 0
    # chunk_size = max(1024*1024,int(total_size/1000))
    chunk_size = 1024
    #for chunk in tqdm(r.iter_content(chunk_size=chunk_size),total=total_size,unit='KB',unit_scale=True):
    for chunk in r.iter_content(chunk_size=chunk_size):
        downloaded += len(chunk)
        a=f.write(chunk)
        done = int(50 * downloaded/ total_size)
        sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50 - done)))
        sys.stdout.flush()

Answer 1

As the error says : 正如错误所说：

write() argument must be str, not bytes write（）参数必须是str，而不是bytes

so just convert chunk to string : 所以只需将块转换为字符串 ：

f.write(str(chunk))

Note: Instead I would suggest to write to a .bin file and then convert it to .csv 注意：相反，我建议写入.bin文件，然后将其转换为.csv

Answer 2

Try writing with wb instead of just w . 尝试用wb而不是w 。

with open( local_filename, 'wb', encoding= r.encoding ) as f:
    f.write( r.text )

Answer 3

with open(filename, 'wb', encoding=r.encoding) as f:
    f.write(r.content)

This should fix your writing problem. 这应该可以解决你的写作问题。 Write r.content not r.text Since type(r.content) is <class 'bytes'> which is what you need to write in the file 写r.content而不是r.text因为type(r.content)是<class 'bytes'> type(r.content) <class 'bytes'>这是你需要在文件中写的

使用Python从Internet下载大型CSV文件的进度

问题描述

3 个解决方案

解决方案1
1 2018-10-12 07:33:08

解决方案2
0 2018-10-12 07:22:18

解决方案3
0 已采纳 2018-10-12 08:13:57

使用Python从Internet下载大型CSV文件的进度

问题描述

3 个解决方案

解决方案1 1 2018-10-12 07:33:08

解决方案2 0 2018-10-12 07:22:18

解决方案3 0 已采纳 2018-10-12 08:13:57

解决方案1
1 2018-10-12 07:33:08

解决方案2
0 2018-10-12 07:22:18

解决方案3
0 已采纳 2018-10-12 08:13:57