递归查找所有链接使用bs4和Python问题

Question

I'm using the below code to recursively collect all links from the given website the only problem is that I get this in the beginning of the output file:我正在使用下面的代码递归地收集给定网站的所有链接，唯一的问题是我在 output 文件的开头得到了这个：

https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en

https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en

https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en

... ...

https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en/en/en/en/en/en/en/en/en/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en/en/en/en/en/en/en/en/en/en

etc.. ETC..

How can I prevent/remove this?我该如何防止/消除这种情况？

The code:编码：

from bs4 import BeautifulSoup
import requests

# lists 
urls=[] 
   
# function created 
def scrape(site): 
       
    # getting the request from url 
    r = requests.get(site) 
       
    # converting the text 
    s = BeautifulSoup(r.text,"html.parser") 
       
    for i in s.find_all("a"):
          
        href = i.attrs['href'] 
           
        if href.startswith("/"): 
            site = site+href
               
            if site not in  urls: 
                urls.append(site)  
                print(site) 
                # calling it self 
                scrape(site) 
   
# main function 
if __name__ =="__main__": 
   
    # website to be scrape 
    site="https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading"
   
    # calling function 
    scrape(site)

Answer 1

What happens?怎么了？

Your are replacing the site with itself + the href and first href it finds only contains /en .您正在用自身 + href替换站点，并且它发现的第一个href仅包含/en 。

How to fix?怎么修？

Instead of site use a baseUrl代替site使用baseUrl

baseUrl = "https://www.metatrader4.com/"

Not sure why you are calling scrape() over and over again I commented it out in my example, cause it is not necessary.不知道为什么你一遍又一遍地调用scrape()我在我的示例中将其注释掉，因为它没有必要。

Example例子

from bs4 import BeautifulSoup
import requests

# lists 
urls=[] 
   
# function created 
def scrape(site, baseUrl): 
       
    # getting the request from url 
    r = requests.get(site) 
       
    # converting the text 
    s = BeautifulSoup(r.text,"html.parser") 
       
    for i in s.find_all("a"):
          
        href = i.attrs['href'] 
           
        if href.startswith("/"): 
            site = baseUrl+href
               
            if site not in  urls: 
                urls.append(site)  
                print(site) 
                # calling it self 
                # scrape(site, baseUrl) 
   
# main function 
if __name__ =="__main__": 
   
    # website to be scrape 
    site="https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading"
    baseUrl = "https://www.metatrader4.com/"
    # calling function 
    scrape(site, baseUrl)

Output Output

https://www.metatrader4.com/en https://www.metatrader4.com/en/trading-platform https://www.metatrader4.com/en/download https://www.metatrader4.com/en/trading-platform/forex https://www.metatrader4.com/en/trading-platform/orders https://www.metatrader4.com/en/trading-platform/technical-analysis https://www.metatrader4.com/en/trading-platform/alerts-news https://www.metatrader4.com/en https://www.metatrader4.com/en/trading-platform https://www.metatrader4.com/en/download https://www.metatrader4.com/en /trading-platform/forex https://www.metatrader4.com/en/trading-platform/orders https://www.metatrader4.com/en/trading-platform/technical-analysis https://www.metatrader4. com/en/trading-platform/alerts-news

递归查找所有链接使用bs4和Python问题

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-02-23 08:36:14

What happens?怎么了？

How to fix?怎么修？

递归查找所有链接使用bs4和Python问题

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-02-23 08:36:14

What happens?怎么了？

How to fix?怎么修？

解决方案1
0 已采纳 2021-02-23 08:36:14