在具有 Python 的网站上抓取和 plot 连接页面的最佳方法是什么？

Question

我一直在从事一个项目，该项目接受 url 的输入并在网站上创建页面连接的 map。

我解决这个问题的方法是抓取页面的链接，然后创建一个页面 object 来保存页面的 href 和该页面上所有子链接的列表。 一旦我从网站上的所有页面中提取数据，我会将其传递给图形 function，例如 matplotlib 或 plotly，以获得网站上页面之间连接的图形表示。

到目前为止，这是我的代码：

from urllib.request import urlopen
import urllib.error
from bs4 import BeautifulSoup, SoupStrainer

#object to hold page href and child links on page
class Page:

    def __init__(self, href, links):
        self.href = href
        self.children = links

    def getHref(self):
        return self.href

    def getChildren(self):
        return self.children


#method to get an array of all hrefs on a page
def getPages(url):
    allLinks = []

    try:
        #combine the starting url and the new href
        page = urlopen('{}{}'.format(startPage, url))
        for link in BeautifulSoup(page, 'html.parser', parse_only=SoupStrainer('a')):
            try:
                if 'href' in link.attrs:
                    allLinks.append(link)
            except AttributeError:
                #if there is no href, skip the link
                continue
            
        #return an array of all the links on the page
        return allLinks

    #catch pages that can't be opened
    except urllib.error.HTTPError:
        print('Could not open {}{}'.format(startPage, url))
    

#get starting page url from user
startPage = input('Enter a URL: ')
page = urlopen(startPage)

#sets to hold unique hrefs and page objects
pages = set()
pageObj = set()

for link in BeautifulSoup(page, 'html.parser', parse_only=SoupStrainer('a')):
    try:
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                newPage = link.attrs['href']
                pages.add(newPage)

                #get the child links on this page
                pageChildren = getPages(newPage)

                #create a new page object, add to set of page objects
                pageObj.add(Page(newPage, pageChildren))
    except AttributeError:
        print('{} has an attribute error.'.format(link))
        continue

Scrapy 对于我想要做的事情会更好吗？
哪个库最适合显示连接？
如何修复 getPages function 以正确组合用户输入的 url 与从页面中提取的 href？ 如果我在 'https://en.wikipedia.org/wiki/Main_Page'，我会得到 'Could not open https://en.wikipedia.org/wiki/Main_Page/wiki/English_language' 。 我想我需要从 the.org/ 的末尾合并并删除 /wiki/Main_Page 但我不知道最好的方法来做到这一点。

这是我的第一个真正的项目，所以任何关于如何改进我的逻辑的指示都值得赞赏。

Answer 1

这是第一个项目的好主意！

Scrapy 对于我想要做的事情会更好吗？

与当前版本相比，您的项目的 scrapy 版本具有许多优势。 您会立即感受到的优势是您提出请求的速度。 但是，您可能需要一段时间才能习惯 scrapy 项目的结构。

如何修复 getPages function 以正确组合用户输入的 url 与从页面中提取的 href？ 如果我在 'https://en.wikipedia.org/wiki/Main_Page'，我会得到 'Could not open https://en.wikipedia.org/wiki/Main_Page/wiki/English_language' 。 我想我需要从 the.org/ 的末尾合并并删除 /wiki/Main_Page 但我不知道最好的方法来做到这一点。

您可以使用urllib.parse.urljoin(startPage, relativeHref)来实现这一点。 您将找到的大多数链接都是相对链接，然后您可以使用 urljoin function 将其转换为绝对链接。
在您的代码中，您将newPage = link.attrs['href']更改为newPage = urllib.parse.urljoin(startPage, link.attrs['href'])和page = urlopen('{}{}'.format(startPage, url))到page = urlopen(url) 。

以下是几个示例，说明您可以在哪里稍微更改代码以获得一些好处。

而不是for link in BeautifulSoup(page, 'html.parser', parse_only=SoupStrainer('a')):中的for link in BeautifulSoup(page, 'html.parser').find_all('a', href=True): . 这样，您的所有链接都已经保证有一个 href。

为了防止同一页面上的链接出现两次，您应该将allLinks = []改为一个集合。

这取决于偏好，但自从 Python 3.6 以来，还有另一种称为“f-Strings”的语法用于引用字符串中的变量。 例如，您可以将print('{} has an attribute error.'.format(link))更改为print(f'{link} has an attribute error.') 。

在具有 Python 的网站上抓取和 plot 连接页面的最佳方法是什么？

问题描述

1 个解决方案

解决方案1
0 2020-06-22 20:04:16

在具有 Python 的网站上抓取和 plot 连接页面的最佳方法是什么？

问题描述

1 个解决方案

解决方案1 0 2020-06-22 20:04:16

解决方案1
0 2020-06-22 20:04:16