使用Python仅收集网页中的href的第一级

Question

I need to retrieve only the 1st level in a href of a website. 我只需要检索网站href中的第一级。 For example: http://www.example.com/ is the website that I need to open and read.I opened the page and collected the href's and I obtained all the links like /company/organization, /company/globallocations, /company/newsroom, /contact, /sitemap and so on. 例如： http : //www.example.com/是我需要打开和阅读的网站。我打开了页面并收集了href并获得了所有链接，例如/ company / organization，/ company / globallocations，/公司/新闻室，/联系人，/站点地图等。

Below is the python code. 以下是python代码。

req = urllib2.Request(domain)
response = urllib2.urlopen(req)
soup1 = BeautifulSoup(response,'lxml')
for link in soup1.find_all('a',href = True):
     print link['href']

My desired output is, 我想要的输出是

/company, /contact, /sitemap for the website www.example.com www.example.com网站的/ company，/ contact，/ sitemap

Kindly help and suggest me a solution. 请帮助并建议我一个解决方案。

Answer 1

The first level concept is not clear, if you believe href links with one / is a first level, just simply count how many / in the href text, and decide keep it or drop it. 第一层的概念不清楚，如果您认为href与一个/链接是第一层，则只需简单地计算href文本中有多少个/并决定保留还是删除它即可。

If we consider the web page point of view, all links in the home page, should be considered as first level . 如果考虑网页的观点，则主页中的所有链接都应被视为第一级 。 In this case, you may need to create a level counter to count how many levels / how deep your crawler goes into, and stop at certain level. 在这种情况下，您可能需要创建一个级别计数器来计算爬虫进入的级别/深度，并在特定级别停止。

Hope that helps. 希望能有所帮助。

使用Python仅收集网页中的href的第一级

问题描述

1 个解决方案

解决方案1
0 2017-05-29 04:23:50

使用Python仅收集网页中的href的第一级

问题描述

1 个解决方案

解决方案1 0 2017-05-29 04:23:50

解决方案1
0 2017-05-29 04:23:50