简体   繁体   中英

Find all possible links in a website / Screen-Web Scraping with Python

Sort of an open-ended question here. I needed to go across a Job site and search for a Job Description tag and a Skill requirement (I'm done with this). I basically wanted to know, how do I crawl across the site? As in, go from test.com to test.com/a and so on....?? Basically, crawl the page.

This is my code to search within the page. I need to find all the possible such pages in the site and get the link. THIS IS NOT HOMEWORK. I'm just doing this on the side...

import urllib2
import re

html_content = urllib2.urlopen('http://www.ziprecruiter.com/job/Systems-     Engineer/b5452eab/?source=customer-cpc-indeed').read()

matchDescription = re.findall('Bachelor', html_content);
matchSkill = re.findall('VMware', html_content);


print matchDescription
print matchSkill

if ( len(matchDescription) and len(matchSkill) )== 0: 
   print 'I did not find anything'
else:
   print 'My string is in the html'

Consider using Scrapy or some other existing scraping framework. Otherwise, you need to find the necessary links manually using lxml or some other HTML parser and crawl them using some manual mechanism based on urllib or something like that and some data structures to store input and output data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM