简体   繁体   中英

Tree traverse in Python

I'm trying to write a script to find out the non-responsive links of a web-page in python. While trying, i find out that python doesn't support multi child nodes. Is it true? or we can access the multi child nodes.

Below is my code snippet:

import httplib2
import requests
from bs4 import BeautifulSoup, SoupStrainer

status = {}
response = {}
output = {}

def get_url_status(url, count):
    global links
    links = []
    print(url)
    print(count)
    if count == 0:
        return output
    else:
        # if url not in output.keys():
        headers = requests.utils.default_headers()
        req = requests.get(url, headers)
        if('200' in str(req)):
            # if url not in output.keys():
            output[url] = '200';
            for link in BeautifulSoup(req.content, parse_only=SoupStrainer('a')):
                if 'href' in str(link):
                    links.append(link.get('href'))

            # removing other non-mandotary links
            for link in links[:]:
                if "mi" not in link:
                    links.remove(link)

            # removing same url
            for link in links[:]:
                if link.rstrip('/') == url:
                    links.remove(link)

            # removing duplicate links
            links = list(dict.fromkeys(links))
            if len(links) > 0:
                for urllink in links:
                    return get_url_status(urllink, count-1)

result = get_url_status('https://www.mi.com/in', 5)
print(result)

In this code it's only traversing to only the left nodes of the child and skipping rest. something like this. 在此处输入图片说明

And the output is not satisfactory and very very less compared to real.

{'https://www.mi.com/in': '200', 'https://in.c.mi.com/': '200', 'https://in.c.mi.com/index.php': '200', 'https://in.c.mi.com/global/': '200', 'https://c.mi.com/index.php': '200'}

I know, i'm lacking at multiple locations but i've never done something of this scale and this is my first time. So please excuse if this is a novice question.

Note: I've used mi.com just for the reference.

At a glance, there's one obvious problem.

if len(links) > 0:
    for urllink in links:
        return get_url_status(urllink, count-1)

This snippet does not iterate over links . It has return in its iterative body which means it will only run for the first item in links, and immediately return it. There is another bug. The function returns just None instead of output if it encounters a page with no links before count reaches 0. Do the following instead.

if len(links):
    for urllink in links:
        get_url_status(urllink, count-1)
return output

And if('200' in str(req)) is not the right way to check the status code. It will check for a substring '200' in the body, instead of only checking the status code. It should be if req.status_code == 200 .

Another thing is that the function only adds responsive links to output . If you want to check for non-responsive links, don't you have to add links that do not return the 200 status code?

import requests
from bs4 import BeautifulSoup, SoupStrainer

status = {}
response = {}
output = {}

def get_url_status(url, count):
    global links
    links = []
    # if url not in output.keys():
    headers = requests.utils.default_headers()
    req = requests.get(url, headers)
    if req.status_code == 200:
        # if url not in output.keys():
        output[url] = '200'
        if count == 0:
            return output
        for link in BeautifulSoup(req.content, parse_only=SoupStrainer('a'), parser="html.parser"):
            if 'href' in str(link):
                links.append(link.get('href'))

        # removing other non-mandotary links
        for link in links:
            if "mi" not in link:
                links.remove(link)

        # removing same url
        for link in links:
            if link.rstrip('/') == url:
                links.remove(link)

        # removing duplicate links
        links = list(dict.fromkeys(links))
        print(links)
        if len(links):
            for urllink in links:
                get_url_status(urllink, count-1)
        return output

result = get_url_status('https://www.mi.com/in', 1)
print(result)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM