简体   繁体   中英

Recursive function gives no output

I'm scraping all the URL of my domain with recursive function. But it outputs nothing, without any error.

#usr/bin/python

from bs4 import BeautifulSoup
import requests
import tldextract


def scrape(url):

    for links in url:
        main_domain = tldextract.extract(links)
        r = requests.get(links)
        data = r.text
        soup = BeautifulSoup(data)
    
        for href in soup.find_all('a'):
            href = href.get('href')
            if not href:
                continue
            link_domain = tldextract.extract(href)
        
            if link_domain.domain == main_domain.domain :
                problem.append(href)
    
            elif not href == '#' and link_domain.tld == '':
                new = 'http://www.'+ main_domain.domain + '.' + main_domain.tld + '/' + href
                problem.append(new)

        return len(problem)
        return scrape(problem)
        

problem = ["http://xyzdomain.com"]  
print(scrape(problem))

When I create a new list, it works, but I don't want to make a list every time for every loop.

You need to structure your code so that it meets the pattern for recursion as your current code doesn't - you also should not call variables the same name as libraries, eg href = href.get() because this will usually stop the library working as it becomes the variable, your code as it currently is will only ever return the len() as this return is unconditionally reached before: return scrap(problem) .:

def Recursive(Factorable_problem)
    if Factorable_problem is Simplest_Case:
        return AnswerToSimplestCase
    else:
        return Rule_For_Generating_From_Simpler_Case(Recursive(Simpler_Case))

for example:

def Factorial(n):
    """ Recursively Generate Factorials """
    if n < 2:
        return 1
    else:
        return n * Factorial(n-1)

Hello I've made a none recursive version of this that appears to get all the links on the same domain.

The code below I've tested using the problem included in the code. When I'd solved the problems with the recursive version the next problem was hitting the recursion depth limit so I rewrote it so it ran in an iterative fashion, the code and result below:

from bs4 import BeautifulSoup
import requests
import tldextract


def print_domain_info(d):
    print "Main Domain:{0} \nSub Domain:{1} \nSuffix:{2}".format(d.domain,d.subdomain,d.suffix)

SEARCHED_URLS = []
problem = [ "http://Noelkd.neocities.org/", "http://youpi.neocities.org/"]
while problem:
    # Get a link from the stack of links
    link = problem.pop()
    # Check we haven't been to this address before
    if link in SEARCHED_URLS:
        continue
    # We don't want to come back here again after this point
    SEARCHED_URLS.append(link)
    # Try and get the website
    try:
        req = requests.get(link)
    except:
        # If its not working i don't care for it
        print "borked website found: {0}".format(link)
        continue
    # Now we get to this point worth printing something
    print "Trying to parse:{0}".format(link)
    print "Status Code:{0}  Thats: {1}".format(req.status_code, "A-OK" if req.status_code == 200 else "SOMTHINGS UP" )
    # Get the domain info
    dInfo = tldextract.extract(link)
    print_domain_info(dInfo)
    # I like utf-8
    data = req.text.encode("utf-8")
    print "Lenght Of Data Retrived:{0}".format(len(data))  # More info
    soup = BeautifulSoup(data)  # This was here before so i left it.
    print "Found {0} link{1}".format(len(soup.find_all('a')),"s" if len(soup.find_all('a')) > 1 else "")
    FOUND_THIS_ITERATION = []  # Getting the same links over and over was boring
    found_links = [x for x in soup.find_all('a') if x.get('href') not in SEARCHED_URLS]  # Find me all the links i don't got
    for href in found_links: 
        href = href.get('href') # You wrote this seems to work well
        if not href:
            continue
        link_domain = tldextract.extract(href) 
        if link_domain.domain == dInfo.domain: # JUST FINDING STUFF ON SAME DOMAIN RIGHT?!
            if href not in FOUND_THIS_ITERATION: # I'ma check you out next time 
                print "Check out this link: {0}".format(href)
                print_domain_info(link_domain)
                FOUND_THIS_ITERATION.append(href)
                problem.append(href)
            else: # I got you already
                print "DUPE LINK!"
        else: 
            print "Not on same domain moving on" 

    # Count down
    print "We have {0} more sites to search".format(len(problem))
    if problem:
        continue
    else:
        print "Its been fun"
        print "Lets see the URLS we've visited:"
        for url in SEARCHED_URLS:
            print url

Which prints, after a lot of other logging loads of neocities websites!

What's happening is the script is popping a value of the list of websites yet to visit, it then gets all the links on the page which are on the same domain. If those links are to pages we haven't visited we add the link to the list of links to be visited. After we do that we pop the next page and do the same thing again until there are no pages left to visit.

Think this is what your looking for, get back to us in the comments if this doesn't work in the way that you want or if anyone can improve please leave a comment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM