I am using python to crawl web pages and I am doing it iteratively- so I am using 3 html files to store the web pages but somehow I am finding that these files are not getting overwritten and I am still getting old files. Here is the code that I am using:
def Vals(a,b):
file1="C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file1.html"
file2="C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file22.html"
file3="C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file33.html"
Query1='"http://scholar.google.com/scholar?q=%22'+a+'%22&btnG=&hl=en&as_sdt=0%2C24"'
URL1='wget --user-agent Mozilla '+Query1+' -O '+file1
Query2='"http://scholar.google.com/scholar?q=%22'+b+'%22&btnG=&hl=en&as_sdt=0%2C24"'
URL2='wget --user-agent Mozilla '+Query2+' -O '+file2
Query3='"http://scholar.google.com/scholar?q=%22'+a+'%22+%22'+b+'%22&btnG=&hl=en&as_sdt=0%2C24"'
URL3='wget --user-agent Mozilla '+Query3+' -O '+file3
## print Query1
## print Query2
## print Query3
##
## print URL1
## print URL2
## print URL3
os.system("wget "+ URL1)
os.system("wget "+ URL2)
os.system("wget "+ URL3)
f1 = open(file1,'r+')
f2 = open(file2,'r+')
f3 = open(file3,'r+')
S1=str(f1.readlines())
start=S1.find("About")+6
stop=S1.find("results",start)-1
try:
val1=float((S1[start:stop]).replace(",",""))
except ValueError:
val1=Reads('C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file1.html')
S1=str(f2.readlines())
#f2.close()
start=S1.find("About")+6
stop=S1.find("results",start)-1
try:
val2=float((S1[start:stop]).replace(",",""))
except ValueError:
val2=Reads('C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file22.html')
S1=str(f3.readlines())
#f3.close()
start=S1.find("About")+6
stop=S1.find("results",start)-1
try:
val3=float((S1[start:stop]).replace(",",""))
except ValueError:
val3=Reads('C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file33.html')
f1.truncate()
f2.truncate()
f3.truncate()
f1.close()
f2.close()
f3.close()
return (val1,val2,val3)
Can anyone tell if there is some error in closing the files or how shall I close them for my purpose.
Thanks
you're using the -O
(capital O) option, which concatenates everything to 1 file.
'-O file' '--output-document=file'
The documents will not be written to the appropriate files, but all will be concatenated together and written to file. If '-' is used as file, documents will be printed to standard output, disabling link conversion. (Use './-' to print to a file literally named '-'.) Use of '-O' is not intended to mean simply “use the name file instead of the one in the URL;” rather, it is analogous to shell redirection:
wget -O file http://foo
is intended to work likewget -O - http://foo > file
; file will be truncated immediately, and all downloaded content will be written there.For this reason, '-N' (for timestamp-checking) is not supported in combination with '-O': since file is always newly created, it will always have a very new timestamp. A warning will be issued if this combination is used.
Similarly, using '-r' or '-p' with '-O' may not work as you expect: Wget won't just download the first file to file and then download the rest to their normal names: all downloaded content will be placed in file. This was disabled in version 1.11, but has been reinstated (with a warning) in 1.11.2, as there are some cases where this behavior can actually have some use.
Note that a combination with '-k' is only permitted when downloading a single document, as in that case it will just convert all relative URIs to external ones; '-k' makes no sense for multiple URIs when they're all being downloaded to a single file; '-k' can be used only when the output is a regular file.
This snippet was taken from wget's manual .
Hope this helps.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.