简体   繁体   中英

How to wrap this into a recursive call

I'm writing a little spider in python. It hits a page, gets all the links on the page according to a pattern, goes to each of those pages, and repeats.

I need to make this recursive somehow. Below are the url patterns:

www.example.com

Then I get all links based on a regex, then I visit each page.

Recursive part:

Say I am visiting a page with a url like:

www.example.com/category/1

Now if this page contains links like:

www.example.com/category/1_234

(basically the same url, except it has an additional "_234234")

I visit that page, then check for a url like:

www.example.com/category/1_234_4232

(again, same url plus an underscore and a number)

I keep doing this until there are no more links fitting this pattern.

1. visit a category page
2. does it contain links with the same url + "_dddd" if yes, visit that page
3. back to #2 unless no links

I don't need the regex, I just need help with structuring the recursive call.

Just-recursion, in the manner you are asking for, might not be the best approach.

For one thing, if there are more than 1000 pages to be indexed, you might 'bottom out' the call stack and crash. For another, if you are not careful about releasing variables during recursion, it might end up consuming quite a bit of memory.

I would recommend something more like:

visit_next = set(['www.example.com'])
visited = set()

while len(visit_next):
    next_link = visit_next.pop()
    if next_link not in visited:
        visited.add(next_link)
        for link in find_all_links_from(next_link):
            if link not in visited:
                visit_next.add(link)

Edit:

as suggested, I have rewritten this using sets; this should use less memory (on a long traversal, the visit_next list would almost certainly have collected many duplicate links).

Basically, you'd do it as others have outlines (with flaws):

def visit(url):
    content = get_content(url)
    links = extract_links(content)
    for link in links:
        visit(link)

However (as also pointed out), you have to keep track of sites you already visited. A naive (but kinda effective) approach is storing them (perhaps after "normalizing" them to catch practically identical links that are written down differently - for instance, strip /#.*$/ ). Example:

visited = set()
def visit(url):
    content = get_content(url)
    links = extract_and_normalize_links(content)
    for link in links:
        if link not in visited:
            visited.add(link)
            visit(link)

Stack overflows could become an issue depending on how much links there are (on the other hand, if you reach 1000 calls the script will run a very long time anyway). If you do have that many indirections, you should remove the recursion like shown by another answer. Or use a stack - a list - if you care about scraping the same order as you would with recursion:

while link_stack:
    current_page = link_stack.pop()
    # get links
    link_stack.extend(links_from_current_page)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM