简体   繁体   English

如何在python中抓取受密码保护的网站?

[英]How to spider a password protected site in python?

currently I have a spider written in Java that logs into a supplier website and spiders the website. 目前,我有一个用Java编写的蜘蛛,它登录到供应商网站并对该网站进行蜘蛛化。 (using htmlunit) (使用htmlunit)

It keeps the session (cookie) and even lets me enable/disable javascript etc. 它保留了会话(cookie),甚至允许我启用/禁用javascript等。

I also use htmlparser (java) to help parse the html and extract the relevant information. 我还使用htmlparser(java)来帮助解析html并提取相关信息。

Does python have something similar to do this? python有类似的方法吗?

Python has urllib2 to crawl pages, which supports password authentication and cookies. Python具有urllib2来爬网页面,该页面支持密码身份验证和cookie。

There is also a HTMLParser for extracting html, but some people prefer the more feature-full BeatifulSoup . 还有一个用于提取html的HTMLParser ,但有些人更喜欢功能更丰富的BeatifulSoup

Scrapy API使用urllib2加上一些不同的解析器和帮助程序例程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM