简体   繁体   中英

file overwrite in python

I am using python to crawl web pages and I am doing it iteratively- so I am using 3 html files to store the web pages but somehow I am finding that these files are not getting overwritten and I am still getting old files. Here is the code that I am using:

def Vals(a,b):
    file1="C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file1.html"
    file2="C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file22.html"
    file3="C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file33.html"

    Query1='"http://scholar.google.com/scholar?q=%22'+a+'%22&btnG=&hl=en&as_sdt=0%2C24"'
    URL1='wget --user-agent Mozilla '+Query1+' -O '+file1

    Query2='"http://scholar.google.com/scholar?q=%22'+b+'%22&btnG=&hl=en&as_sdt=0%2C24"'
    URL2='wget --user-agent Mozilla '+Query2+' -O '+file2

    Query3='"http://scholar.google.com/scholar?q=%22'+a+'%22+%22'+b+'%22&btnG=&hl=en&as_sdt=0%2C24"'
    URL3='wget --user-agent Mozilla '+Query3+' -O '+file3
##    print Query1
##    print Query2
##    print Query3
##    
##    print URL1
##    print URL2
##    print URL3


    os.system("wget "+ URL1)
    os.system("wget "+ URL2)
    os.system("wget "+ URL3)

    f1 = open(file1,'r+')
    f2 = open(file2,'r+')
    f3 = open(file3,'r+')

    S1=str(f1.readlines())

    start=S1.find("About")+6
    stop=S1.find("results",start)-1
    try:
        val1=float((S1[start:stop]).replace(",",""))
    except ValueError:
        val1=Reads('C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file1.html')

    S1=str(f2.readlines())
    #f2.close()
    start=S1.find("About")+6
    stop=S1.find("results",start)-1

    try:
        val2=float((S1[start:stop]).replace(",",""))
    except ValueError:
        val2=Reads('C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file22.html')

    S1=str(f3.readlines())
    #f3.close()
    start=S1.find("About")+6
    stop=S1.find("results",start)-1
    try:
        val3=float((S1[start:stop]).replace(",",""))
    except ValueError:
        val3=Reads('C:\\Users\\YAS_ayush\\Desktop\\dataset_recommendation\\file33.html')
    f1.truncate()
    f2.truncate()
    f3.truncate()
    f1.close()
    f2.close()
    f3.close()
    return (val1,val2,val3)

Can anyone tell if there is some error in closing the files or how shall I close them for my purpose.

Thanks

you're using the -O (capital O) option, which concatenates everything to 1 file.

'-O file' '--output-document=file'

The documents will not be written to the appropriate files, but all will be concatenated together and written to file. If '-' is used as file, documents will be printed to standard output, disabling link conversion. (Use './-' to print to a file literally named '-'.) Use of '-O' is not intended to mean simply “use the name file instead of the one in the URL;” rather, it is analogous to shell redirection: wget -O file http://foo is intended to work like wget -O - http://foo > file ; file will be truncated immediately, and all downloaded content will be written there.

For this reason, '-N' (for timestamp-checking) is not supported in combination with '-O': since file is always newly created, it will always have a very new timestamp. A warning will be issued if this combination is used.

Similarly, using '-r' or '-p' with '-O' may not work as you expect: Wget won't just download the first file to file and then download the rest to their normal names: all downloaded content will be placed in file. This was disabled in version 1.11, but has been reinstated (with a warning) in 1.11.2, as there are some cases where this behavior can actually have some use.

Note that a combination with '-k' is only permitted when downloading a single document, as in that case it will just convert all relative URIs to external ones; '-k' makes no sense for multiple URIs when they're all being downloaded to a single file; '-k' can be used only when the output is a regular file.

This snippet was taken from wget's manual .

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM