简体   繁体   中英

Python recursive crawling for urls

I have this method that when supplied with a list of links will get the child links and so on and so forth:

def crawlSite(self, linksList):
    finalList = []
    for link in list(linksList):
        if link not in finalList:
            print link            
            finalList.append(link)
            childLinks = self.getAllUniqueLinks(link)
            length = len(childLinks)
            print 'Total links for this page: ' + str(length)

        self.crawlSite(childLinks)
    return finalList

It eventually will repeat itself with the same set of links and I can't seem to figure it out. When I move the self.crawlSite(childLinks) inside of the if statement. I get the first item in the list repeated over and over.

Background on the self.getAllUniqueLinks(link) method get a list of links from a given page. It filters to all click-able links within a given domain. Basically what I am trying to do is get all click-able links from a website. If this isn't the desired approach. Could you recommend a better method that can do the exact same thing. Please also consider that I am fairly new to python and might not understand more complex approaches. So please explain your thought processes. If you don't mind:)

You need

finalList.extend(self.crawlSite(childLinks))

not just

self.crawlSite(childLinks)

You need to merge the list returned by the inner crawlSite() s with the list already extant in the outer crawlSite() . Even though they're all called finalList , you have a different list in each scope.

The alternative (and better) solution is to have finalList be an instance variable (or nonlocal variable of some type) instead of just a local variable, so that it's shared by all the scopes of the crawlSite() s:

def __init__(self, *args, **kwargs):
    self.finalList = set()

def crawlSite(self, linksList):
    for link in linksList:
        if link not in self.finalList:
            print link            
            self.finalList.add(link)
            childLinks = self.getAllUniqueLinks(link)
            length = len(childLinks)
            print 'Total links for this page: ' + str(length)
            self.crawlSite(childLinks)

You just need to make sure you self.finalList = [] if you want to start over from scratch with the same instance.

Edit: Fixed the code by putting the recursive call in the if block. Used a set. Also, linksList doesn't need to be a list, just an iterable object, so removed the list() call from the for loop. Set was suggested by @Ray-Toal

You are clearing the finalLinks array on each recursive call.

What is needed is a more global set of links that you have already visited. Each recursive call should contribute to this global list, otherwise, if your graph has cycles, you are sure to eventually visit a site twice.

UPDATE: Check out the nice pattern used in DFS on a graph using a python generator . Your finalList can be a parameter, with a default of [] . Add to this list in each recursive call. Also, FWIW, consider a set rather than a list --- it is faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM