简体   繁体   English

通过从文件中读取URL列表来保存网页

[英]Saving the web pages by reading the list of urls from file

Only way I could think of only one way of solving this problem but it has some listed limitations. 我只能想到的唯一方法就是解决此问题,但它列出了一些局限性。 Can somebody suggest of another way of solving this problem? 有人可以建议解决这个问题的另一种方式吗?

we have given a text file with 999999 URLs. 我们提供了一个带有999999 URL的文本文件。 we have to write a python program to read this file & save all the webpages in a folder called 'saved_page'. 我们必须编写一个python程序来读取此文件并将所有网页保存在名为“ saved_pa​​ge”的文件夹中。

I have tried to solve this problem something like this, 我已经尝试解决这个问题,

import os
import urllib
save_path='C:/test/home_page/'
Name = os.path.join(save_path, "test.txt")
file = open('soop.txt', 'r')
'''all the urls are in the soop.txt file '''
for line in file:
    data = urllib.urlopen(line)
    for line in data:
        f=open(Name)
        lines=f.readlines()
        f.close()
        lines.append(line)
        f=open(Name,"w")
        f.writelines(lines)
        f.close()
file.close()

Here are some limitations with this code, 此代码有一些限制,

1).If network goes down, this code will restart. 1)。如果网络中断,此代码将重新启动。

2).If it comes across a bad URL - ie server doesn't respond - this code will be stuck. 2)。如果遇到错误的URL(即服务器无响应),则此代码将被卡住。

3).I am currently downloading in sequence - this will be quite slow for large no of URLS. 3)。我目前正在按顺序下载-对于没有URL的情况,这将非常缓慢。

So can somebody suggest a solution that would address these problems as well? 那么有人可以提出一种解决这些问题的解决方案吗?

Some remarks : 一些说明:

Point 1 and 2 can easily be fixed by a restart point method. 通过重新启动点方法可以轻松固定点1和2。 For in a in script restart, just do a loop until all is ok or max number of attemps under the line for line in file containing the read part and only write if you could successfully dowload file. 对于在脚本内重新启动,只需执行一个循环,直到 for line in file包含读取部分的for line in filefor line in file全部正常或最大尝试次数为止,并且只有在成功下载文件后才进行写操作。 You will still have to decide what to do in case or not downloadable file : either log an error and continue with next file or abort the whole job. 您仍然必须决定要处理的文件是否为不可下载状态:记录错误并继续处理下一个文件,或者中止整个作业。

If you want to be able to restart later a failed job, you should keep somewhere (a state.txt file) the list of successfully downloaded files. 如果您希望以后能够重新启动失败的作业,则应将成功下载的文件列表保留在某个位置( state.txt文件)。 You write (and flush ) after each file got and written. 在获取并写入每个文件之后,您进行写入(和冲洗 )。 But to be really boolet proof, you should write one element after getting the file, and one element after successfully writing it. 但是要真正证明是布尔值,您应该在获取文件后写入一个元素,而在成功写入文件后写入一个元素。 That way, on restart, you can know is the output file may contain a partially written file (power outage, break, ...) by simply testing the presence of state file and its content. 这样,在重新启动时,您可以通过简单地测试状态文件及其内容的存在来知道输出文件可能包含部分写入的文件(停电,中断等)。

Point 3 would be much more tricky. 第三点要棘手得多。 To allow parrallel download, you will have to use threads or asyncio. 要允许并行下载,您将必须使用线程或异步。 But you will also have to synchronize all that to ensure that the files are written in output file in proper order. 但是,您还必须同步所有这些内容,以确保以正确的顺序将文件写入输出文件。 If you can afford to keep everything in memory, a simple way would be to first download everything using parralelized method (the link given by JF Sebastian can help), and then write in order. 如果您有能力将所有内容保留在内存中,则一种简单的方法是首先使用并行化方法下载所有内容(JF Sebastian提供的链接可以提供帮助),然后按顺序进行编写。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM