简体   繁体   English

递归查找所有链接使用bs4和Python问题

[英]Find all links recursively using bs4 and Python problem

I'm using the below code to recursively collect all links from the given website the only problem is that I get this in the beginning of the output file:我正在使用下面的代码递归地收集给定网站的所有链接,唯一的问题是我在 output 文件的开头得到了这个:

https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en

https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en

https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en

... ...

https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en/en/en/en/en/en/en/en/en/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en/en/en/en/en/en/en/en/en/en

etc.. ETC..

How can I prevent/remove this?我该如何防止/消除这种情况?

The code:编码:

from bs4 import BeautifulSoup
import requests

# lists 
urls=[] 
   
# function created 
def scrape(site): 
       
    # getting the request from url 
    r = requests.get(site) 
       
    # converting the text 
    s = BeautifulSoup(r.text,"html.parser") 
       
    for i in s.find_all("a"):
          
        href = i.attrs['href'] 
           
        if href.startswith("/"): 
            site = site+href
               
            if site not in  urls: 
                urls.append(site)  
                print(site) 
                # calling it self 
                scrape(site) 
   
# main function 
if __name__ =="__main__": 
   
    # website to be scrape 
    site="https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading"
   
    # calling function 
    scrape(site)

What happens?怎么了?

Your are replacing the site with itself + the href and first href it finds only contains /en .您正在用自身 + href替换站点,并且它发现的第一个href仅包含/en

How to fix?怎么修?

Instead of site use a baseUrl代替site使用baseUrl

baseUrl = "https://www.metatrader4.com/"

Not sure why you are calling scrape() over and over again I commented it out in my example, cause it is not necessary.不知道为什么你一遍又一遍地调用scrape()我在我的示例中将其注释掉,因为它没有必要。

Example例子

from bs4 import BeautifulSoup
import requests

# lists 
urls=[] 
   
# function created 
def scrape(site, baseUrl): 
       
    # getting the request from url 
    r = requests.get(site) 
       
    # converting the text 
    s = BeautifulSoup(r.text,"html.parser") 
       
    for i in s.find_all("a"):
          
        href = i.attrs['href'] 
           
        if href.startswith("/"): 
            site = baseUrl+href
               
            if site not in  urls: 
                urls.append(site)  
                print(site) 
                # calling it self 
                # scrape(site, baseUrl) 
   
# main function 
if __name__ =="__main__": 
   
    # website to be scrape 
    site="https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading"
    baseUrl = "https://www.metatrader4.com/"
    # calling function 
    scrape(site, baseUrl)

Output Output

https://www.metatrader4.com/en https://www.metatrader4.com/en/trading-platform https://www.metatrader4.com/en/download https://www.metatrader4.com/en/trading-platform/forex https://www.metatrader4.com/en/trading-platform/orders https://www.metatrader4.com/en/trading-platform/technical-analysis https://www.metatrader4.com/en/trading-platform/alerts-news https://www.metatrader4.com/en https://www.metatrader4.com/en/trading-platform https://www.metatrader4.com/en/download https://www.metatrader4.com/en /trading-platform/forex https://www.metatrader4.com/en/trading-platform/orders https://www.metatrader4.com/en/trading-platform/technical-analysis https://www.metatrader4. com/en/trading-platform/alerts-news

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM