简体   繁体   中英

How to get all links from php page using python and mechanize

I would like to extract all links from a web page. Here is my code so far.

import mechanize
import lxml.html
from time import sleep

links = list()
visited_links = list()

br = mechanize.Browser()

def findLinks(url):
    response = br.open(url)
    visited_links.append(response.geturl())

    for link in br.links():
        response = br.follow_link(link)
        links.append(response.geturl())
        sleep(1)


findLinks("http://temelelektronik.net")

for link in links:
    if link in visited_links:
        links.remove(link)
    else:
        findLinks(link)
        print link

for link in visited_links:
    print link

In fact I don't want to write a web crawler. What I'd like to do is extract all links from a web page and create a site map. I also wonder whether is it possible to get last modification time of a file from server using mechanize and python.

What I'd like to ask is while this code snippet works fine for HTML pages. It doesn't extract links from php pages. For example this page . How can I extract links from php pages?

Any help would be appreciated. Thanks..

I don't know mechanize, but I have used the pattern.web module, which has an HTML DOM Parser easy to use. I think for a site map is similar to what you are looking for:

from pattern.web import URL, DOM

url = URL("http://temelelektronik.net")
dom = DOM(url.download())
for anchor in dom.by_tag('a'):
    print(anchor.href)

Here is another solution which uses a web spider to visit each link.

import os, sys; sys.path.insert(0, os.path.join("..", ".."))

from pattern.web import Spider, DEPTH, BREADTH, FIFO, LIFO

class SimpleSpider1(Spider):

    def visit(self, link, source=None):
        print "visiting:", link.url, "from:", link.referrer

    def fail(self, link):
        print "failed:", link.url

spider1 = SimpleSpider1(links=["http://www.temelelektronik.net/"], domains=["temelelektronik.net"], delay=0.0)

print "SPIDER 1 " + "-" * 50
while len(spider1.visited) < 5:
    spider1.crawl(cached=False)

The syntax that is specific to Mechanize goes as follows.

agent=Mechanize.new

page=agent.get(URL)

page.links returns an array of all links in the page.

page.links.first.text returns the text (without the href) of the first link.

page.link_with(:text=>"Text").click would return the page that would result on clicking the specific page

Hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM