简体   繁体   中英

What's the best way to scrape and plot connected pages on a website with Python?

I've been working on a project that takes an input of a url and creates a map of the page connections on a website.

The way I was approaching this was to scrape the page for links, then create a page object to hold the href of the page and a list of all the child links on that page. Once I have the data pulled from all the pages on the site I would pass it to a graphing function like matplotlib or plotly in order to get a graphical representation of the connections between pages on a website.

This is my code so far:

from urllib.request import urlopen
import urllib.error
from bs4 import BeautifulSoup, SoupStrainer

#object to hold page href and child links on page
class Page:

    def __init__(self, href, links):
        self.href = href
        self.children = links

    def getHref(self):
        return self.href

    def getChildren(self):
        return self.children


#method to get an array of all hrefs on a page
def getPages(url):
    allLinks = []

    try:
        #combine the starting url and the new href
        page = urlopen('{}{}'.format(startPage, url))
        for link in BeautifulSoup(page, 'html.parser', parse_only=SoupStrainer('a')):
            try:
                if 'href' in link.attrs:
                    allLinks.append(link)
            except AttributeError:
                #if there is no href, skip the link
                continue
            
        #return an array of all the links on the page
        return allLinks

    #catch pages that can't be opened
    except urllib.error.HTTPError:
        print('Could not open {}{}'.format(startPage, url))
    

#get starting page url from user
startPage = input('Enter a URL: ')
page = urlopen(startPage)

#sets to hold unique hrefs and page objects
pages = set()
pageObj = set()

for link in BeautifulSoup(page, 'html.parser', parse_only=SoupStrainer('a')):
    try:
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage = link.attrs['href']
                pages.add(newPage)

                #get the child links on this page
                pageChildren = getPages(newPage)

                #create a new page object, add to set of page objects
                pageObj.add(Page(newPage, pageChildren))
    except AttributeError:
        print('{} has an attribute error.'.format(link))
        continue
  • Would Scrapy be better for what I'm trying to do?
  • What library would work best for displaying the connections?
  • How do I fix the getPages function to correctly combine the user-inputted url with the hrefs pulled from the page? If I'm at 'https://en.wikipedia.org/wiki/Main_Page', I'll get 'Could not open https://en.wikipedia.org/wiki/Main_Page/wiki/English_language' . I think I need to combine from the end of the.org/ and drop the /wiki/Main_Page but I don't know the best way to do this.

This is my first real project so any pointers on how I could improve my logic are appreciated.

thats a nice idea for a first project!

Would Scrapy be better for what I'm trying to do?

There are numerous advantages that a scrapy version of your project would have over your current version. The advantage you would feel immediatly is the speed at which your requests are made. However, it may take you a while to get used to the structure of scrapy projects.

How do I fix the getPages function to correctly combine the user-inputted url with the hrefs pulled from the page? If I'm at 'https://en.wikipedia.org/wiki/Main_Page', I'll get 'Could not open https://en.wikipedia.org/wiki/Main_Page/wiki/English_language' . I think I need to combine from the end of the.org/ and drop the /wiki/Main_Page but I don't know the best way to do this.

You can achieve this using urllib.parse.urljoin(startPage, relativeHref) . Most of the links you're gonna find are relative links which you can then convert to an absolute link using the urljoin function.
In your code you would change newPage = link.attrs['href'] to newPage = urllib.parse.urljoin(startPage, link.attrs['href']) and page = urlopen('{}{}'.format(startPage, url)) to page = urlopen(url) .

Here are a couple of examples as to where you can change your code slightly for some benefits.

Instead of for link in BeautifulSoup(page, 'html.parser', parse_only=SoupStrainer('a')): you can use BeautifulSoup's find_all() function like this for link in BeautifulSoup(page, 'html.parser').find_all('a', href=True): . This way all your links are already guaranteed to have an href.

In order to prevent links on the same page from occuring twice, you should change allLinks = [] to be a set instead.

This is up to preference, but since Python 3.6 there is another syntax called "f-Strings" for referencing variables in strings. You could change print('{} has an attribute error.'.format(link)) to print(f'{link} has an attribute error.') for example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM