简体   繁体   English

使用Python仅收集网页中的href的第一级

[英]Collect only the 1st level of href in a webpage using Python

I need to retrieve only the 1st level in a href of a website. 我只需要检索网站href中的第一级。 For example: http://www.example.com/ is the website that I need to open and read.I opened the page and collected the href's and I obtained all the links like /company/organization, /company/globallocations, /company/newsroom, /contact, /sitemap and so on. 例如: http : //www.example.com/是我需要打开和阅读的网站。我打开了页面并收集了href并获得了所有链接,例如/ company / organization,/ company / globallocations,/公司/新闻室,/联系人,/站点地图等。

Below is the python code. 以下是python代码。

req = urllib2.Request(domain)
response = urllib2.urlopen(req)
soup1 = BeautifulSoup(response,'lxml')
for link in soup1.find_all('a',href = True):
     print link['href']

My desired output is, 我想要的输出是

/company, /contact, /sitemap for the website www.example.com www.example.com网站的/ company,/ contact,/ sitemap

Kindly help and suggest me a solution. 请帮助并建议我一个解决方案。

The first level concept is not clear, if you believe href links with one / is a first level, just simply count how many / in the href text, and decide keep it or drop it. 第一层的概念不清楚,如果您认为href与一个/链接是第一层,则只需简单地计算href文本中有多少个/并决定保留还是删除它即可。

If we consider the web page point of view, all links in the home page, should be considered as first level . 如果考虑网页的观点,则主页中的所有链接都应被视为第一级 In this case, you may need to create a level counter to count how many levels / how deep your crawler goes into, and stop at certain level. 在这种情况下,您可能需要创建一个级别计数器来计算爬虫进入的级别/深度,并在特定级别停止。

Hope that helps. 希望能有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM