Extract count of specific links from a web page.

Question

I am writing a python script using BeautifulSoup. I need to scrape a website and count unique links ignoring the links starting with '#'.

Example if the following links exist on a webpage:

https://www.stackoverflow.com/questions

https://www.stackoverflow.com/foo

https://www.cnn.com/

For this example, the only two unique links will be (The link information after the main domain name is removed):

https://stackoverflow.com/    Count 2
https://cnn.com/              Count 1

Note: this is my first time using python and web scraping tools.

I appreciate all the help in advance.

This is what I have tried so far:

from bs4 import BeautifulSoup
import requests


url = 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'

r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")


count = 0

for link in soup.find_all('a'):
    print(link.get('href'))
    count += 1

Answer 1

There is a function named urlparse from urllib.parse which you can get netloc of urls. And there is a new awesome HTTP library named requests_html which can help you get all links in source file.

from requests_html import HTMLSession
from collections import Counter
from urllib.parse import urlparse

session = HTMLSession()
r = session.get("the link you want to crawl")
unique_netlocs = Counter(urlparse(link).netloc for link in r.html.absolute_links)
for link in unique_netlocs:
    print(link, unique_netlocs[link])

Answer 2

You could also do this:

from bs4 import BeautifulSoup
from collections import Counter
import requests

soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)").text, "html.parser")

foundUrls = Counter([link["href"] for link in soup.find_all("a", href=lambda href: href and not href.startswith("#"))])
foundUrls = foundUrls.most_common()

for item in foundUrls:
    print ("%s: %d" % (item[0], item[1]))

The soup.find_all line checks if every a tag has an href set and if it doesn't start with the # character. The Counter method counts the occurrences of each list entry and the most_common orders by the value.

The for loop just prints the results.

Answer 3

My way to do this is to find all links using beautiful soup and then determine which link redirects to which location:

def get_count_url(url): # get the umber of links having the same domain and suffix
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
urls={} #dictionary for the domains
# input_domain=url.split('//')[1].split('/')[0]
#library to extract the exact domain( ex.- blog.bbc.com and bbc.com have the same domains )
input_domain=tldextract.extract(url).domain+"."+tldextract.extract(url).suffix 
for link in soup.find_all('a'):
    word =link.get('href')
    # print(word)
    if word:
        # Same website or domain calls
        if "#" in word or word[0]=="/": #div call or same domain call
            if not input_domain in urls:
                # print(input_domain)
                urls[input_domain]=1 #if first encounter with the domain
            else:
                urls[input_domain]+=1 #multiple encounters
        elif "javascript" in word:
            # javascript function calls (for domains that use modern JS frameworks to display information)
            if not "JavascriptRenderingFunctionCall" in urls:
                urls["JavascriptRenderingFunctionCall"]=1
            else:
                urls["JavascriptRenderingFunctionCall"]+=1
        else:
            # main_domain=word.split('//')[1].split('/')[0]
            main_domain=tldextract.extract(word).domain+"." +tldextract.extract(word).suffix
            # print(main_domain)
            if main_domain.split('.')[0]=='www':
                main_domain = main_domain.replace("www.","") # removing the www
            if not main_domain in urls: # maintaining the dictionary
                urls[main_domain]=1
            else:
                urls[main_domain]+=1
        count += 1

for key, value in urls.items(): # printing the dictionary in a paragraph format for better readability
    print(key,value)
return count

tld extract finds the correct url name and soup.find_all('a') finds a tags. The if statements check for same domain redirect, javascript redirect or other domain redirects.

Extract count of specific links from a web page.

Question

3 answers

solution1
2 ACCPTED 2018-03-06 09:01:36

solution2
0 2018-03-06 09:47:46

solution3
0 2018-10-18 04:40:27

Extract count of specific links from a web page.

Question

3 answers

solution1 2 ACCPTED 2018-03-06 09:01:36

solution2 0 2018-03-06 09:47:46

solution3 0 2018-10-18 04:40:27

solution1
2 ACCPTED 2018-03-06 09:01:36

solution2
0 2018-03-06 09:47:46

solution3
0 2018-10-18 04:40:27