Python网站搜寻器，使用Scrapy保存文件

Question

I'm attempting to write a crawler that will take a certain search entry and saving a whole bunch of .CSV files correlated to the results. 我正在尝试编写一个爬网程序，该爬网程序将使用某个搜索条目，并保存与结果相关的一堆.CSV文件。

I already have the spider logging in, parsing all the html data I need, and now all I have left to do is figure out how I can save the files I need. 我已经登录了Spider，解析所有需要的html数据，现在剩下要做的就是弄清楚如何保存所需的文件。

So the search returns links such as this https://www.thissite.com/data/file_download.jsp?filetype=1&id=22944 因此，搜索返回的链接例如https://www.thissite.com/data/file_download.jsp?filetype=1&id=22944

Which then in a web browser prompts you to save the correlated .csv file. 然后在网络浏览器中提示您保存相关的.csv文件。 How can I write my spider to be able to load this page and download the file? 如何编写我的Spider才能加载此页面并下载文件？ Or is there a way I can catch a static link to the information? 还是有一种方法可以捕获到信息的静态链接？

Answer 1

If you crawled the link to the CSV files you can simply download them with wget, which is able to login to a page too. 如果您抓取了CSV文件的链接，则只需使用wget下载即可，该程序也可以登录页面。

You either specify the --http-user and --http-passwd, or you use cookies as follows: 您可以指定--http-user和--http-passwd，或按如下方式使用Cookie：

$ wget --cookies=on --keep-session-cookies --save-cookies=cookie.txt --post-data "login=USERNAME&password=PASSWORD" http://first_page
$ wget --referer=http://first_page --cookies=on --load-cookies=cookie.txt --keep-session-cookies --save-cookies=cookie.txt http://second_page

It depens on how your site handles logins. 它取决于您的网站如何处理登录。 There are a few other ways to login to a page with wget, I'm sure you find those by googling. 还有其他几种使用wget登录页面的方法，我相信您可以通过谷歌搜索找到它们。

I'd suggest doing all this in a special Scrapy Pipeline , so it's all done in Scrapy and not in an external script. 我建议在特殊的Scrapy Pipeline中完成所有这些操作，因此，所有操作均在Scrapy中完成，而不是在外部脚本中完成。

Python网站搜寻器，使用Scrapy保存文件

问题描述

1 个解决方案

解决方案1
1 2011-08-19 06:42:41

Python网站搜寻器，使用Scrapy保存文件

问题描述

1 个解决方案

解决方案1 1 2011-08-19 06:42:41

解决方案1
1 2011-08-19 06:42:41