简体   繁体   English

使用Python和线程下载文件

[英]File downloading using python with threads

I'm creating a python script which accepts a path to a remote file and an n number of threads. 我正在创建一个python脚本,该脚本接受远程文件的路径和n个线程。 The file's size will be divided by the number of threads, when each thread completes I want them to append the fetch data to a local file. 文件的大小将除以线程数,当每个线程完成时,我希望它们将提取数据附加到本地文件中。

How do I manage it so that the order in which the threads where generated will append to the local file in order so that the bytes don't get scrambled? 我如何管理它,以便将生成的线程的顺序按顺序追加到本地文件,以免字节被打乱?

Also, what if I'm to download several files simultaneously? 另外,如果我要同时下载多个文件怎么办?

You could coordinate the works with locks &c, but I recommend instead using Queue -- usually the best way to coordinate multi-threading (and multi-processing) in Python. 您可以使用锁&c协调工作,但我建议改用Queue-通常是在Python中协调多线程(和多处理)的最佳方法。

I would have the main thread spawn as many worker threads as you think appropriate (you may want to calibrate between performance, and load on the remote server, by experimenting); 我会让主线程生成您认为合适的尽可能多的工作线程(您可能希望通过实验来在性能之间进行校准,并在远程服务器上进行加载); every worker thread waits at the same global Queue.Queue instance, call it workQ for example, for "work requests" ( wr = workQ.get() will do it properly -- each work request is obtained by a single worker thread, no fuss, no muss). 每个工作线程都在相同的全局Queue.Queue实例上等待,例如,将其workQ ,对于“工作请求”( wr = workQ.get()会正确执行-每个工作请求均由单个工作线程获得,否大惊小怪,没有糊涂)。

A "work request" can in this case simply be a triple (tuple with three items): identification of the remote file (URL or whatever), offset from which it is requested to get data from it, number of bytes to get from it (note that this works just as well for one or multiple files ot fetch). 在这种情况下,“工作请求”可以简单地是一个三元组(包含三项的元组):远程文件的标识(URL或其他内容),从中获取数据所需的偏移量,从中获取字节数(请注意,这对于获取一个或多个文件同样有效)。

The main thread pushes all work requests to the workQ (just workQ.put((url, from, numbytes)) for each request) and waits for results to come to another Queue instance, call it resultQ (each result will also be a triple: identifier of the file, starting offset, string of bytes that are the results from that file at that offset). 主线程将所有工作请求推送到workQ (每个请求只是workQ.put((url, from, numbytes)) ),然后等待结果到达另一个Queue实例,将其resultQ (每个结果也将是三倍) :文件的标识符,起始偏移量,是该文件在该偏移量处的结果的字节字符串)。

As each working thread satisfies the request it's doing, it puts the results into resultQ and goes back to fetch another work request (or wait for one). 当每个工作线程都满足其正在执行的请求时,它将结果放入resultQ并返回以获取另一个工作请求(或等待一个)。 Meanwhile the main thread (or a separate dedicated "writing thread" if needed -- ie if the main thread has other work to do, for example on the GUI) gets results from resultQ and performs the needed open , seek , and write operations to place the data at the right spot. 同时,主线程(如果需要,或单独的专用“写线程”,即,如果主线程还有其他工作要做,例如在GUI上)从resultQ获取结果并执行所需的openseekwrite操作以将数据放在正确的位置。

There are several ways to terminate the operation: for example, a special work request may be asking the thread receiving it to terminate -- the main thread puts on workQ just as many of those as there are working threads, after all the actual work requests, then joins all the worker threads when all data have been received and written (many alternatives exist, such as joining the queue directly, having the worker threads daemonic so they just go away when the main thread terminates, and so forth). 终止操作有几种方法:例如,一个特殊的工作请求可能正在要求接收该操作的线程终止-在所有实际的工作请求之后,主线程将workQ放入与工作线程一样多的工作上,然后在接收和写入所有数据时加入所有工作线程(存在许多替代方法,例如直接加入队列,具有工作线程守护程序,以便它们在主线程终止时消失),等等。

You need to fetch completely separate parts of the file on each thread. 您需要在每个线程上获取文件的完全独立的部分。 Calculate the chunk start and end positions based on the number of threads. 根据线程数计算块的开始和结束位置。 Each chunk must have no overlap obviously. 每个块都必须没有明显重叠。

For example, if target file was 3000 bytes long and you want to fetch using three thread: 例如,如果目标文件的长度为3000字节,而您想使用三个线程来获取:

  • Thread 1: fetches bytes 1 to 1000 线程1:获取字节1至1000
  • Thread 2: fetches bytes 1001 to 2000 线程2:获取1001到2000字节
  • Thread 3: fetches bytes 2001 to 3000 线程3:获取字节2001到3000

You would pre-allocate an empty file of the original size, and write back to the respective positions within the file. 您将预先分配一个原始大小的空文件,然后写回文件中的各个位置。

You can use a thread safe "semaphore", like this: 您可以使用线程安全的“信号量”,如下所示:

class Counter:
  counter = 0
  @classmethod
  def inc(cls):
    n = cls.counter = cls.counter + 1 # atomic increment and assignment
    return n

Using Counter.inc() returns an incremented number across threads, which you can use to keep track of the current block of bytes. 使用Counter.inc()会在线程之间返回递增的数字,您可以使用该数字来跟踪当前的字节块。

That being said, there's no need to split up file downloads into several threads, because the downstream is way slower than the writing to disk, so one thread will always finish before the next one is downloading. 话虽这么说,没有必要将文件下载分成多个线程,因为下游比写入磁盘要慢得多,因此一个线程将始终在下一个线程下载之前完成。

The best and least resource hungry way is simply to have a download file descriptor linked directly to a file object on disk. 最好和最省资源的方法是简单地将下载文件描述符直接链接到磁盘上的文件对象。

for "download several files simultaneously", I recommond this article: Practical threaded programming with Python . 对于“同时下载多个文件”,我推荐这篇文章: 使用Python进行实用的线程编程 It provides a simultaneously download related example by combining threads with Queues, I thought it's worth a reading. 通过将线程与Queues相结合,它提供了一个同时下载的相关示例,我认为值得一读。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM