从ASP网站抓取JavaScript下载链接

Question

I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly. 我正在尝试从该网站下载所有文件以进行备份和镜像，但是我不知道如何正确解析JavaScript链接。

I need to organize all the downloads in the same way in named folders. 我需要以相同的方式在命名文件夹中组织所有下载。 For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. 例如，在第一个文件夹中，我将有一个名为“ DAP-1150”的文件夹，在其中将是一个名为“ DAP-1150 A1 FW v1.10”的文件夹，其中包含文件“ DAP1150A1_FW110b04_FOSS.zip”，依次类推文件。 I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly. 我尝试在Python中使用beautifulsoup，但它似乎无法正确处理ASP链接。

Answer 1

When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html 如果您在使用Javascript链接时遇到困难，可以尝试Selenium： http : //selenium-python.readthedocs.org/en/latest/getting-started.html

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3)   # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]

You can use the links and pass them to urllib2 to download them accordingly. 您可以使用链接并将其传递给urllib2进行相应的下载。 If you need more than a script, I can recommend you a combination of Scrapy and Selenium: selenium with scrapy for dynamic page 如果您需要的不仅仅是脚本，我建议您将Scrapy和Selenium结合使用：scranium和scrapy用于动态页面

Answer 2

Here's what it is doing. 这就是它在做什么。 I just used the standard Network inspector in Firefox to snapshot the POST operation. 我只是在Firefox中使用了标准的网络检查器来对POST操作进行快照。 Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all. 请记住，就像我向您指出的其他答案一样，这不是一个写得很好的网站-根本不应该使用JS / POST。

First of all, here's the JS - it's very simple: 首先，这是JS-非常简单：

function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}

That writes to these fields: 写入以下字段：

<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>

So, you just need a POST form, targetting this URL: 因此，您只需要一个以该URL为目标的POST表单：

http://tsd.dlink.com.tw/downloads2008detail.asp

And here's an example set of data from FF's network analyser. 这是FF网络分析仪提供的一组示例数据。 There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape: 您只需更改两项-从JS链接中获取-您可以用普通的刮擦方式获取它们：

Enter=OK 输入=确定
ModelCategory=0 ModelCategory = 0
ModelSno=0 ModelSno = 0
ModelCategory_=DAP ModelCategory_ = DAP
ModelSno_=1150 型号Sno_ = 1150
Model_Sno= Model_Sno =
ModelVer= ModelVer =
sel_PageNo=1 sel_PageNo = 1
OS=GPL 操作系统= GPL

You'll probably find by experimentation that not all of them are necessary. 通过实验，您可能会发现并非全部都是必需的。 I did try using GET for this, in the browser, but it looks like the target page insists upon POST. 我确实尝试在浏览器中为此使用GET，但是看起来目标页面坚持使用POST。

Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; 不要忘记在单击和提交之间在刮板中留出相当长的时间，因为每个代表代表远程服务器上的一次命中。 I suggest 5 seconds, emulating a human delay. 我建议5秒，模拟人为延迟。 If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. 如果您做得太快-如果连接良好就可能-远程端可能会假设您正在对它们进行DoSing操作，并且可能会阻塞您的IP。 Remember the motto of scraping: be a good robot! 记住刮刮的座右铭：成为一个好机器人！

从ASP网站抓取JavaScript下载链接

问题描述

2 个解决方案

解决方案1
0 2013-10-22 15:35:45

解决方案2
0 2013-10-23 11:39:30

从ASP网站抓取JavaScript下载链接

问题描述

2 个解决方案

解决方案1 0 2013-10-22 15:35:45

解决方案2 0 2013-10-23 11:39:30

解决方案1
0 2013-10-22 15:35:45

解决方案2
0 2013-10-23 11:39:30