[英]Collect only the 1st level of href in a webpage using Python
I need to retrieve only the 1st level in a href of a website. 我只需要检索网站href中的第一级。 For example: http://www.example.com/ is the website that I need to open and read.I opened the page and collected the href's and I obtained all the links like /company/organization, /company/globallocations, /company/newsroom, /contact, /sitemap and so on. 例如: http : //www.example.com/是我需要打开和阅读的网站。我打开了页面并收集了href并获得了所有链接,例如/ company / organization,/ company / globallocations,/公司/新闻室,/联系人,/站点地图等。
Below is the python code. 以下是python代码。
req = urllib2.Request(domain)
response = urllib2.urlopen(req)
soup1 = BeautifulSoup(response,'lxml')
for link in soup1.find_all('a',href = True):
print link['href']
My desired output is, 我想要的输出是
/company, /contact, /sitemap for the website www.example.com www.example.com网站的/ company,/ contact,/ sitemap
Kindly help and suggest me a solution. 请帮助并建议我一个解决方案。
The first level concept is not clear, if you believe href links with one /
is a first level, just simply count how many /
in the href text, and decide keep it or drop it. 第一层的概念不清楚,如果您认为href与一个/
链接是第一层,则只需简单地计算href文本中有多少个/
并决定保留还是删除它即可。
If we consider the web page point of view, all links in the home page, should be considered as first level . 如果考虑网页的观点,则主页中的所有链接都应被视为第一级 。 In this case, you may need to create a level counter to count how many levels / how deep your crawler goes into, and stop at certain level. 在这种情况下,您可能需要创建一个级别计数器来计算爬虫进入的级别/深度,并在特定级别停止。
Hope that helps. 希望能有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.