无法使用python2.7使用BeautifulSoup模块获取所有href链接

Question

Hi i am using following python code to get all the url links from the webpage : 嗨，我正在使用以下python代码从网页获取所有url链接：

from bs4 import BeautifulSoup
import urllib2

url='https://www.practo.com/delhi/dentist'
resp = urllib2.urlopen(url)
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

But the above code not able to fetch all the links, as you can see below only few links : 但是上面的代码无法获取所有链接，因为您只能在下面看到几个链接：

https://www.practo.com
/health/login
/for-doctors
javascript:void(0);
#
http://help.practo.com/practo-search/practo-relevance-algorithm
http://help.practo.com/practo-search/practo-ranking-algorithm
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara#services
https://www.practo.com
/health/login
/for-doctors
javascript:void(0);
#
http://help.practo.com/practo-search/practo-relevance-algorithm
http://help.practo.com/practo-search/practo-ranking-algorithm
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara#services

Please someone can help me why its happening, is there any other method through which i can able to scrap all thee links.... thanks in advance 请有人可以帮助我为什么会这样，还有其他方法可以使我删除所有链接吗？...在此先感谢

Answer 1

try this one: 试试这个：

import urllib2
import re

url='https://www.practo.com/delhi/dentist?page=1'
resp = urllib2.urlopen(url)
s = resp.read()
regexp = r'http[^"]*"'
pattern = re.compile(regexp)
urls = re.findall(pattern, s)
for i in urls:
    print i

Answer 2

This shall return all the http links in that website: 这将返回该网站中的所有http链接：

from BeautifulSoup import BeautifulSoup
import urllib2

url='https://www.practo.com/delhi/dentist'
resp = urllib2.urlopen(url)
soup = BeautifulSoup(resp)
for i in soup.findAll('a',href = True):
    link = i['href']
    if link[:4] == 'http':
        print link

Answer 3

Had the same problem and was able to fix it by changing the parser used with BeautifulSoup from lxml to html.parser : 遇到了同样的问题，并且可以通过将与BeautifulSoup一起使用的解析器从lxml更改为html.parser来解决此问题：

#!/usr/bin/python3
from bs4 import BeautifulSoup
import urllib.request
import http.server

req = urllib.request.Request(url)
try:
  with urllib.request.urlopen(req) as response:
  html = response.read()
except urllib.error.HTTPError as e:
  errorMsg = http.server.BaseHTTPRequestHandler.responses[e.code][0]
  print("Cannot retrieve URL: {} : {}".format(str(e.code), errorMsg))
except urllib.error.URLError as e:
  print("Cannot retrieve URL: {}".format(e.reason))
except:
  print("Cannot retrieve URL: unknown error")

soup = BeautifulSoup(html, "html.parser")

for link in soup.find_all('a'):
  print("Link: {}".format(link['href']))

You can read more about the different parser in the documentation under Installing a parser . 您可以在文档中的“ 安装解析器”下阅读有关该解析器的更多信息。

无法使用python2.7使用BeautifulSoup模块获取所有href链接

问题描述

3 个解决方案

解决方案1
0 2015-07-15 18:39:43

解决方案2
0 2015-07-15 19:22:16

解决方案3
0 2015-10-15 08:26:23

无法使用python2.7使用BeautifulSoup模块获取所有href链接

问题描述

3 个解决方案

解决方案1 0 2015-07-15 18:39:43

解决方案2 0 2015-07-15 19:22:16

解决方案3 0 2015-10-15 08:26:23

解决方案1
0 2015-07-15 18:39:43

解决方案2
0 2015-07-15 19:22:16

解决方案3
0 2015-10-15 08:26:23