简体   繁体   English

从ASP网站抓取JavaScript下载链接

[英]Scrape JavaScript download links from ASP website

I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly. 我正在尝试从该网站下载所有文件以进行备份和镜像,但是我不知道如何正确解析JavaScript链接。

I need to organize all the downloads in the same way in named folders. 我需要以相同的方式在命名文件夹中组织所有下载。 For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. 例如,在第一个文件夹中,我将有一个名为“ DAP-1150”的文件夹,在其中将是一个名为“ DAP-1150 A1 FW v1.10”的文件夹,其中包含文件“ DAP1150A1_FW110b04_FOSS.zip”,依次类推文件。 I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly. 我尝试在Python中使用beautifulsoup,但它似乎无法正确处理ASP链接。

When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html 如果您在使用Javascript链接时遇到困难,可以尝试Selenium: http : //selenium-python.readthedocs.org/en/latest/getting-started.html

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3)   # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]

You can use the links and pass them to urllib2 to download them accordingly. 您可以使用链接并将其传递给urllib2进行相应的下载。 If you need more than a script, I can recommend you a combination of Scrapy and Selenium: selenium with scrapy for dynamic page 如果您需要的不仅仅是脚本,我建议您将Scrapy和Selenium结合使用:scranium和scrapy用于动态页面

Here's what it is doing. 这就是它在做什么。 I just used the standard Network inspector in Firefox to snapshot the POST operation. 我只是在Firefox中使用了标准的网络检查器来对POST操作进行快照。 Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all. 请记住,就像向您指出的其他答案一样,这不是一个写得很好的网站-根本不应该使用JS / POST。

First of all, here's the JS - it's very simple: 首先,这是JS-非常简单:

function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}

That writes to these fields: 写入以下字段:

<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>

So, you just need a POST form, targetting this URL: 因此,您只需要一个以该URL为目标的POST表单:

http://tsd.dlink.com.tw/downloads2008detail.asp

And here's an example set of data from FF's network analyser. 这是FF网络分析仪提供的一组示例数据。 There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape: 您只需更改两项-从JS链接中获取-您可以用普通的刮擦方式获取它们:

  • Enter=OK 输入=确定
  • ModelCategory=0 ModelCategory = 0
  • ModelSno=0 ModelSno = 0
  • ModelCategory_=DAP ModelCategory_ = DAP
  • ModelSno_=1150 型号Sno_ = 1150
  • Model_Sno= Model_Sno =
  • ModelVer= ModelVer =
  • sel_PageNo=1 sel_PageNo = 1
  • OS=GPL 操作系统= GPL

You'll probably find by experimentation that not all of them are necessary. 通过实验,您可能会发现并非全部都是必需的。 I did try using GET for this, in the browser, but it looks like the target page insists upon POST. 我确实尝试在浏览器中为此使用GET,但是看起来目标页面坚持使用POST。

Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; 不要忘记在单击和提交之间在刮板中留出相当长的时间,因为每个代表代表远程服务器上的一次命中。 I suggest 5 seconds, emulating a human delay. 我建议5秒,模拟人为延迟。 If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. 如果您做得太快-如果连接良好就可能-远程端可能会假设您正在对它们进行DoSing操作,并且可能会阻塞您的IP。 Remember the motto of scraping: be a good robot! 记住刮刮的座右铭:成为一个好机器人!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM