使用Python创建脚本，该脚本将使用BeautifulSoup，请求以及可能的Scrapy在网站的所有页面上搜索特定的URL

Question

I am a new Stack Overflow member so please let me know if and how I can improve this question. 我是Stack Overflow的新成员，所以请让我知道是否以及如何改善此问题。 I am working on a Python script which will take a link to a website's home page, and then search for a specific URL throughout the entire website (not just that first homepage). 我正在研究一个Python脚本，该脚本将链接到网站主页，然后在整个网站（不仅是第一个首页）中搜索特定的URL。 The reason for this is that my research team would like to query a list of websites for a URL to a particular database, without having to click through every single page to find it. 这样做的原因是我的研究团队想在网站列表中查询特定数据库的URL，而不必单击每个页面来查找它。 It is essentially a task of saying "Does this website reference this database? If so, how many times?" 基本上，这是一个任务：“该网站是否引用了该数据库？如果是，则引用几次？” and then keeping that information for our records. 然后将这些信息保存下来供我们记录。 So far, I have been able to use resources on SO and other pages to create a script that will scrape the HTML of the specific webpage I have referenced, and I have included this script for review. 到目前为止，我已经能够使用SO和其他页面上的资源来创建一个脚本，该脚本将抓取我所引用的特定网页的HTML，并且已经包括了该脚本以供查看。

import requests  
from bs4 import BeautifulSoup  

url = raw_input("Enter the name of the website you'd like me to check, followed by a space:")

r = requests.get(url)

soup = BeautifulSoup(r.content, features='lxml')

links = soup.find_all("a")
for link in links:
    if "http" and "dataone" in link.get("href"):
        print("<a href='%s'>%s</a>" %(link.get("href"), link.text))

As you can see, I am looking for a URL linking to a particular database (in this case, DataONE) after being given a website URL by the user. 如您所见，在为用户提供网站URL之后，我正在寻找链接到特定数据库（在本例中为DataONE）的URL。 This script works great, but it only scrapes that particular page that I link -- NOT the entire website. 这个脚本很好用，但是它只会抓取我链接的特定页面，而不是整个网站。 So, if I provide the website: https://www.lib.utk.edu/ , it will only search for references to DataONE within this page but it will not search for references across all of the pages under the UTK Libraries website. 因此，如果我提供网站： https ://www.lib.utk.edu/，它将仅在此页面中搜索对DataONE的引用，而不会在UTK Libraries网站下的所有页面中搜索参考。 **I do not have a high enough reputation on this site yet to post pictures, so I am unable to include an image of this script "in action." **我在该网站上没有很高的声誉，无法发布图片，因此无法“实际使用”此脚本的图片。 ** **

I've heavily researched this on SO to try and gain insight, but none of the questions asked or answered thus far apply to my specific problem. 我已经在SO上对此进行了大量研究，以寻求洞察力，但是到目前为止，所提出或回答的所有问题均不适用于我的特定问题。

Examples: 例子：
1. How can I loop scraping data for multiple pages in a website using python and beautifulsoup4 : in this particular question, the OP can find out how many pages they need to search through because their problem refers to a specific search made on a site. 1. 如何使用python和beautifulsoup4循环在网站中的多个页面上抓取数据：在这个特定问题中，OP可以找出他们需要搜索多少页面，因为他们的问题涉及到站点上的特定搜索。 However, in my case, I will not know how many pages there are in each website. 但是，就我而言，我将不知道每个网站中有多少页。
2. Use BeautifulSoup to loop through and retrieve specific URLs : Again, this is dealing with parsing through URLs but it is not looking through an entire website for URLs. 2. 使用BeautifulSoup遍历并检索特定的URL ：同样，这是通过URL进行解析，但它不是在整个网站中寻找URL。
3. How to loop through each page of website for web scraping with BeautifulSoup : The OP here seems to be struggling with the same problem I am having, but the accepted answer there does not provide enough detail for understanding HOW to approach a problem like this. 3. 如何使用BeautifulSoup循环浏览网站的每个页面以进行Web抓取：这里的OP似乎在解决我遇到的相同问题，但是那里的公认答案没有提供足够的详细信息来理解如何解决这样的问题。

I've scoured the BeautifulSoup documentation but I have not found any help with web scraping an entire website from a single URL (and not knowing how many total pages are in the website). 我已经搜索了BeautifulSoup文档，但是对于从单个URL抓取整个网站（并且不知道网站中有多少总页数）的网站没有任何帮助。 I've looked into using Scrapy, but I'm not sure it's what I need for my purposes on this project, because I am not trying to download or store data -- I am simply trying to see when and where a certain URL is referenced on an entire website. 我已经研究过使用Scrapy，但是我不确定该项目是否是我需要的，因为我不是在尝试下载或存储数据-我只是在尝试查看某个URL的时间和位置。在整个网站上引用。

My question: Is doing something like this possible with BeautifulSoup, and if so, can you suggest how I should change my current code to handle my research problem? 我的问题：BeautifulSoup是否可以做这样的事情，如果可以，您能否建议我应该如何更改当前代码以解决研究问题？ Or is there another program I should look into using? 还是我应该考虑使用另一个程序？

Answer 1

You could use two python sets to keep track of pages you already visited and of pages you need to visit. 您可以使用两个python sets来跟踪已访问的页面和需要访问的页面。

Also: you if condition is wrong, to test both , you cannot use a and b in c you need to do a in c and b in c 另外：如果条件错误，则要同时测试，则不能a and b in c使用a and b in c需要做a in c and b in c

Something like this: 像这样：

import requests  
from bs4 import BeautifulSoup 


baseurl = 'https://example.org'
urls_to_check = {baseurl, }
checked_urls = set()

found_links = []
while urls_to_check:
    url = urls_to_check.pop()
    r = requests.get(url)

    soup = BeautifulSoup(r.content, features='lxml')

    links = soup.find_all("a")
    for link in links:
        if "http" in link.get("href") and "dataone" in link.get("href"):
            found_links.append("<a href='%s'>%s</a>" % (link.get("href"), link.text))
        elif link.get("href", "").startswith("/"):
            if baseurl + link.get("href") not in checked_urls:
                urls_to_check.add(baseurl + link.get("href"))
    checked_urls.add(url)

Answer 2

You will need to implement some form of crawler. 您将需要实现某种形式的搜寻器。

This can be done manually; 这可以手动完成。 essentially, you'd do this: 本质上，您可以这样做：

check if a robots.txt exists and parse it for URLs, adding them to a list to visit later 检查robots.txt是否存在并解析URL，然后将其添加到列表中以供以后访问
parse whatever the first page is you visit for further links; 解析您访问的第一页以获得更多链接； you will probably search for all <a> elements and parse out their href , then figure out if the link is to the same site, eg href="/info.html" , but also href="http://lib.edu.org/info.html" 您可能会搜索所有<a>元素并解析出它们的href ，然后找出链接是否指向同一个站点，例如href="/info.html" ，还包括href="http://lib.edu.org/info.html"
add the identified URLs to a list of URLs to visit 将标识的URL添加到要访问的URL列表中
repeat from 2 until all URLs have been visited 从2开始重复，直到访问完所有URL

I'd recommend looking into Scrapy though. 我还是建议您研究Scrapy。 It lets you define Spider s that you feed with information about what URLs to start at and how to generate further links to visit . 它使您可以定义Spider ，并在其中提供有关开始于哪个URL以及如何生成进一步访问链接的信息。 The Spider has a parse method that you can utilize to search for your database. Spider具有一种parse方法，可用于搜索数据库。 In case of a match, you could update a local SQLite-DB or simply write a count to a textfile. 如果匹配，您可以更新本地SQLite-DB或简单地将计数写入文本文件。

TL;DR: from visiting a single page, it is hard to identify what other pages exist. TL; DR：从访问单个页面开始，很难确定存在其他哪些页面。 You have to parse all internal links. 您必须解析所有内部链接。 A robots.txt can be helpful in this effort, but is not guaranteed to exist. robots.txt在此方面可能会有所帮助，但不能保证存在。

使用Python创建脚本，该脚本将使用BeautifulSoup，请求以及可能的Scrapy在网站的所有页面上搜索特定的URL

问题描述

2 个解决方案

解决方案1
1 2018-10-18 13:53:44

解决方案2
1 已采纳 2018-10-18 14:11:28

使用Python创建脚本，该脚本将使用BeautifulSoup，请求以及可能的Scrapy在网站的所有页面上搜索特定的URL

问题描述

2 个解决方案

解决方案1 1 2018-10-18 13:53:44

解决方案2 1 已采纳 2018-10-18 14:11:28

解决方案1
1 2018-10-18 13:53:44

解决方案2
1 已采纳 2018-10-18 14:11:28