简体   繁体   中英

Extract count of specific links from a web page.

I am writing a python script using BeautifulSoup. I need to scrape a website and count unique links ignoring the links starting with '#'.

Example if the following links exist on a webpage:

https://www.stackoverflow.com/questions

https://www.stackoverflow.com/foo

https://www.cnn.com/

For this example, the only two unique links will be (The link information after the main domain name is removed):

https://stackoverflow.com/    Count 2
https://cnn.com/              Count 1

Note: this is my first time using python and web scraping tools.

I appreciate all the help in advance.

This is what I have tried so far:

from bs4 import BeautifulSoup
import requests


url = 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'

r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")


count = 0

for link in soup.find_all('a'):
    print(link.get('href'))
    count += 1

There is a function named urlparse from urllib.parse which you can get netloc of urls. And there is a new awesome HTTP library named requests_html which can help you get all links in source file.

from requests_html import HTMLSession
from collections import Counter
from urllib.parse import urlparse

session = HTMLSession()
r = session.get("the link you want to crawl")
unique_netlocs = Counter(urlparse(link).netloc for link in r.html.absolute_links)
for link in unique_netlocs:
    print(link, unique_netlocs[link])

You could also do this:

from bs4 import BeautifulSoup
from collections import Counter
import requests

soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)").text, "html.parser")

foundUrls = Counter([link["href"] for link in soup.find_all("a", href=lambda href: href and not href.startswith("#"))])
foundUrls = foundUrls.most_common()

for item in foundUrls:
    print ("%s: %d" % (item[0], item[1]))

The soup.find_all line checks if every a tag has an href set and if it doesn't start with the # character. The Counter method counts the occurrences of each list entry and the most_common orders by the value.

The for loop just prints the results.

My way to do this is to find all links using beautiful soup and then determine which link redirects to which location:

def get_count_url(url): # get the umber of links having the same domain and suffix
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
urls={} #dictionary for the domains
# input_domain=url.split('//')[1].split('/')[0]
#library to extract the exact domain( ex.- blog.bbc.com and bbc.com have the same domains )
input_domain=tldextract.extract(url).domain+"."+tldextract.extract(url).suffix 
for link in soup.find_all('a'):
    word =link.get('href')
    # print(word)
    if word:
        # Same website or domain calls
        if "#" in word or word[0]=="/": #div call or same domain call
            if not input_domain in urls:
                # print(input_domain)
                urls[input_domain]=1 #if first encounter with the domain
            else:
                urls[input_domain]+=1 #multiple encounters
        elif "javascript" in word:
            # javascript function calls (for domains that use modern JS frameworks to display information)
            if not "JavascriptRenderingFunctionCall" in urls:
                urls["JavascriptRenderingFunctionCall"]=1
            else:
                urls["JavascriptRenderingFunctionCall"]+=1
        else:
            # main_domain=word.split('//')[1].split('/')[0]
            main_domain=tldextract.extract(word).domain+"." +tldextract.extract(word).suffix
            # print(main_domain)
            if main_domain.split('.')[0]=='www':
                main_domain = main_domain.replace("www.","") # removing the www
            if not main_domain in urls: # maintaining the dictionary
                urls[main_domain]=1
            else:
                urls[main_domain]+=1
        count += 1

for key, value in urls.items(): # printing the dictionary in a paragraph format for better readability
    print(key,value)
return count    

tld extract finds the correct url name and soup.find_all('a') finds a tags. The if statements check for same domain redirect, javascript redirect or other domain redirects.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM