简体   繁体   中英

Downloading a .csv file from the web (with redirects) in python

Let me start by saying that I know there are a few topics discussing problems similar to mine, but the suggested solutions do not seem to work for me for some reason. Also, I am new to downloading files from the internet using scripts. Up until now I have mostly used python as a Matlab replacement (using numpy/scipy).

My goal: I want to download a lot of .csv files from an internet database ( http://dna.korea.ac.kr/vhot/ ) automatically using python. I want to do this because it is too cumbersome to download the 1000+ csv files I require by hand. The database can only be accessed using a UI, where you have to select several options from a drop down menu to finally end up with links to .csv files after some steps. I have figured out that the url you get after filling out the drop down menus and pressing 'search' contains all the parameters of the drop-down menu. This means I can just change those instead of using the drop down menu, which helps a lot.

An example url from this website is (lets call it ): = http://dna.korea.ac.kr/vhot/search.php?species=Human&selector=drop&mirname=&mirname_drop=hbv-miR-B2RC&pita=on&set=and&miranda_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&gene= ): = http://dna.korea.ac.kr/vhot/search.php?species=Human&selector=drop&mirname=&mirname_drop=hbv-miR-B2RC&pita=on&set=和&miranda_th = -5 rh_th = -10 ts_th = 0&mt_th = 7.3&pt_th = 99999&基因=

On this page I can select 5 csv-files, one example directs me to the following url:

= http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=and&gene_filter=&method=pita&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&targetscan=&miranda=&rnahybrid=&microt=&pita=on = http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=and&gene_filter=&method=pita&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th = 99999&targetscan =&米兰达=&rnahybrid =&microt =&皮塔=上

However, this doesn't contain the csv file directly, but appears to be a 'redirect' (a new term for me, that I found by googeling, so correct me if I am wrong).

One strange thing. I appear to have to load url1 in my browser before I can access url2 (I do not know if it has to be the same day, or hour. url2 didn't work for me today and it did yesterday. Only after after accessing url1 did it work again...). If I do not access url1 before url2 I get "no results" instead of my csv file from my browser. Does anyone know what is going on here?

However, my main problem is that I cannot save the csv files from python. I have tried using the packages urllib, urllib2 and request but I cannot get it to work. From what i understand the Requests package should take care of redirects, but I haven't been able to make it work.

The solutions from the following web pages do not appear to work for me (or I am messing up):

stackoverflow.com/questions/7603044/how-to-download-a-file-returned-indirectly-from-html-form-submission-pyt

stackoverflow.com/questions/9419162/python-download-returned-zip-file-from-url

techniqal.com/blog/2008/07/31/python-file-read-write-with-urllib2/

Some of the things I have tried include:

import urllib2
import csv
import sys

url = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita='

#1
u = urllib2.urlopen(url)
localFile = open('file.csv', 'w')
localFile.write(u.read())
localFile.close()

#2
req = urllib2.Request(url)
res = urllib2.urlopen(req)
finalurl = res.geturl()
pass
# finalurl = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita='

#3
import requests
r = requests.get(url)
r.content
pass
#r.content =  "< s c r i p t > location.replace('download_send.php?name=qgN9Th&type=targetscan'); < / s c r i p t >"

#4
import requests
r = requests.get(url, 
allow_redirects=True,
data={'download_open': 'Download', 'format_open': '.csv'})
print r.content
# r.content = "

#5
import urllib
test1 = urllib.urlretrieve(url, "test.csv")
test2 = urllib.urlopen(url)
pass

For #2, #3 and #4 the outputs are displayed after the code. For #1 and #5 I just get a .csv file with </script>'

Option #3 just gives me a new redirect I think, can this help me?

Can anybody help me with my problem?

The page does not send a HTTP Redirect , instead the redirect is done via JavaScript. urllib and requests do not process javascript, so they cannot follow to the download url. You have to extract the final download url by yourself, and then open it, using any of the methods.

You could extract the URL using the re module with a regex like r'location.replace\\((.*?)\\)'

Based on the response from ch3ka, I think I got it to work. From the source code I get the java redirect, and from this redirect I can get the data.

#Find source code
redirect = requests.get(url).content

#Search for the java redirect (find it in the source code) 
# --> based on answer ch3ka
m = re.search(r"location.replace\(\'(.*?)\'\)", redirect).group(1)

# Now you need to create url from this redirect, and using this url get the data
data = requests.get(new_url).content

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM