简体   繁体   中英

Saving the web pages by reading the list of urls from file

Only way I could think of only one way of solving this problem but it has some listed limitations. Can somebody suggest of another way of solving this problem?

we have given a text file with 999999 URLs. we have to write a python program to read this file & save all the webpages in a folder called 'saved_page'.

I have tried to solve this problem something like this,

import os
import urllib
save_path='C:/test/home_page/'
Name = os.path.join(save_path, "test.txt")
file = open('soop.txt', 'r')
'''all the urls are in the soop.txt file '''
for line in file:
    data = urllib.urlopen(line)
    for line in data:
        f=open(Name)
        lines=f.readlines()
        f.close()
        lines.append(line)
        f=open(Name,"w")
        f.writelines(lines)
        f.close()
file.close()

Here are some limitations with this code,

1).If network goes down, this code will restart.

2).If it comes across a bad URL - ie server doesn't respond - this code will be stuck.

3).I am currently downloading in sequence - this will be quite slow for large no of URLS.

So can somebody suggest a solution that would address these problems as well?

Some remarks :

Point 1 and 2 can easily be fixed by a restart point method. For in a in script restart, just do a loop until all is ok or max number of attemps under the line for line in file containing the read part and only write if you could successfully dowload file. You will still have to decide what to do in case or not downloadable file : either log an error and continue with next file or abort the whole job.

If you want to be able to restart later a failed job, you should keep somewhere (a state.txt file) the list of successfully downloaded files. You write (and flush ) after each file got and written. But to be really boolet proof, you should write one element after getting the file, and one element after successfully writing it. That way, on restart, you can know is the output file may contain a partially written file (power outage, break, ...) by simply testing the presence of state file and its content.

Point 3 would be much more tricky. To allow parrallel download, you will have to use threads or asyncio. But you will also have to synchronize all that to ensure that the files are written in output file in proper order. If you can afford to keep everything in memory, a simple way would be to first download everything using parralelized method (the link given by JF Sebastian can help), and then write in order.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM