I am writing a python script using BeautifulSoup. I need to scrape a website and count unique links ignoring the links starting with '#'.
Example if the following links exist on a webpage:
https://www.stackoverflow.com/questions
https://www.stackoverflow.com/foo
For this example, the only two unique links will be (The link information after the main domain name is removed):
https://stackoverflow.com/ Count 2
https://cnn.com/ Count 1
Note: this is my first time using python and web scraping tools.
I appreciate all the help in advance.
This is what I have tried so far:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
for link in soup.find_all('a'):
print(link.get('href'))
count += 1
There is a function named urlparse
from urllib.parse
which you can get netloc
of urls. And there is a new awesome HTTP library named requests_html
which can help you get all links in source file.
from requests_html import HTMLSession
from collections import Counter
from urllib.parse import urlparse
session = HTMLSession()
r = session.get("the link you want to crawl")
unique_netlocs = Counter(urlparse(link).netloc for link in r.html.absolute_links)
for link in unique_netlocs:
print(link, unique_netlocs[link])
You could also do this:
from bs4 import BeautifulSoup
from collections import Counter
import requests
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)").text, "html.parser")
foundUrls = Counter([link["href"] for link in soup.find_all("a", href=lambda href: href and not href.startswith("#"))])
foundUrls = foundUrls.most_common()
for item in foundUrls:
print ("%s: %d" % (item[0], item[1]))
The soup.find_all
line checks if every a
tag has an href
set and if it doesn't start with the # character. The Counter method counts the occurrences of each list entry and the most_common
orders by the value.
The for
loop just prints the results.
My way to do this is to find all links using beautiful soup and then determine which link redirects to which location:
def get_count_url(url): # get the umber of links having the same domain and suffix
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
count = 0
urls={} #dictionary for the domains
# input_domain=url.split('//')[1].split('/')[0]
#library to extract the exact domain( ex.- blog.bbc.com and bbc.com have the same domains )
input_domain=tldextract.extract(url).domain+"."+tldextract.extract(url).suffix
for link in soup.find_all('a'):
word =link.get('href')
# print(word)
if word:
# Same website or domain calls
if "#" in word or word[0]=="/": #div call or same domain call
if not input_domain in urls:
# print(input_domain)
urls[input_domain]=1 #if first encounter with the domain
else:
urls[input_domain]+=1 #multiple encounters
elif "javascript" in word:
# javascript function calls (for domains that use modern JS frameworks to display information)
if not "JavascriptRenderingFunctionCall" in urls:
urls["JavascriptRenderingFunctionCall"]=1
else:
urls["JavascriptRenderingFunctionCall"]+=1
else:
# main_domain=word.split('//')[1].split('/')[0]
main_domain=tldextract.extract(word).domain+"." +tldextract.extract(word).suffix
# print(main_domain)
if main_domain.split('.')[0]=='www':
main_domain = main_domain.replace("www.","") # removing the www
if not main_domain in urls: # maintaining the dictionary
urls[main_domain]=1
else:
urls[main_domain]+=1
count += 1
for key, value in urls.items(): # printing the dictionary in a paragraph format for better readability
print(key,value)
return count
tld extract finds the correct url name and soup.find_all('a') finds a tags. The if statements check for same domain redirect, javascript redirect or other domain redirects.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.