简体   繁体   中英

How to write a python script to search a website html for matching links

I am not too familiar with python and have to write a script to perform a host of functions. Basically the module i still need is how to check a website code for matching links provided beforehand.

Matching links what? Their HREF attribute? The link display text? Perhaps something like:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
import urllib2

doc = urllib2.urlopen("http://somesite.com").read()
links = SoupStrainer('a', href=re.compile(r'^test'))
soup = [str(elm) for elm in BeautifulSoup(doc, parseOnlyThese=links)]
for elm in soup:
    print elm

That will grab the HTML content of somesite.com and then parse it using BeautifulSoup, looking only for links whose HREF attribute starts with "test". It then builds a list of these links and prints them out.

You can modify this to do anything using the documentation .

Generally, you use urllib , urllib2 (htmllib etc) for programming web in Python. you could also use mechanize , curl etc. Then for processing HTML and getting links, you would want to use parsers like BeautifulSoup .

try scrapy , the most comprehensive web extraction framework.

http://scrapy.org

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM