[英]Find all links recursively using bs4 and Python problem
I'm using the below code to recursively collect all links from the given website the only problem is that I get this in the beginning of the output file:我正在使用下面的代码递归地收集给定网站的所有链接,唯一的问题是我在 output 文件的开头得到了这个:
https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en
https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en
https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en
... ...
https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en/en/en/en/en/en/en/en/en/en https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading/en/en/en/en/en/en/en/en/en/en/en/en
etc.. ETC..
How can I prevent/remove this?我该如何防止/消除这种情况?
The code:编码:
from bs4 import BeautifulSoup
import requests
# lists
urls=[]
# function created
def scrape(site):
# getting the request from url
r = requests.get(site)
# converting the text
s = BeautifulSoup(r.text,"html.parser")
for i in s.find_all("a"):
href = i.attrs['href']
if href.startswith("/"):
site = site+href
if site not in urls:
urls.append(site)
print(site)
# calling it self
scrape(site)
# main function
if __name__ =="__main__":
# website to be scrape
site="https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading"
# calling function
scrape(site)
Your are replacing the site with itself + the href
and first href
it finds only contains /en
.您正在用自身 +
href
替换站点,并且它发现的第一个href
仅包含/en
。
Instead of site
use a baseUrl
代替
site
使用baseUrl
baseUrl = "https://www.metatrader4.com/"
Not sure why you are calling scrape()
over and over again I commented it out in my example, cause it is not necessary.不知道为什么你一遍又一遍地调用
scrape()
我在我的示例中将其注释掉,因为它没有必要。
Example例子
from bs4 import BeautifulSoup
import requests
# lists
urls=[]
# function created
def scrape(site, baseUrl):
# getting the request from url
r = requests.get(site)
# converting the text
s = BeautifulSoup(r.text,"html.parser")
for i in s.find_all("a"):
href = i.attrs['href']
if href.startswith("/"):
site = baseUrl+href
if site not in urls:
urls.append(site)
print(site)
# calling it self
# scrape(site, baseUrl)
# main function
if __name__ =="__main__":
# website to be scrape
site="https://www.metatrader4.com/en/trading-platform/help/beginning/autotrading"
baseUrl = "https://www.metatrader4.com/"
# calling function
scrape(site, baseUrl)
Output Output
https://www.metatrader4.com/en https://www.metatrader4.com/en/trading-platform https://www.metatrader4.com/en/download https://www.metatrader4.com/en/trading-platform/forex https://www.metatrader4.com/en/trading-platform/orders https://www.metatrader4.com/en/trading-platform/technical-analysis https://www.metatrader4.com/en/trading-platform/alerts-news
https://www.metatrader4.com/en https://www.metatrader4.com/en/trading-platform https://www.metatrader4.com/en/download https://www.metatrader4.com/en /trading-platform/forex https://www.metatrader4.com/en/trading-platform/orders https://www.metatrader4.com/en/trading-platform/technical-analysis https://www.metatrader4. com/en/trading-platform/alerts-news
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.