Hi i am using following python code to get all the url links from the webpage :
from bs4 import BeautifulSoup
import urllib2
url='https://www.practo.com/delhi/dentist'
resp = urllib2.urlopen(url)
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset'))
for link in soup.find_all('a', href=True):
print link['href']
But the above code not able to fetch all the links, as you can see below only few links :
https://www.practo.com
/health/login
/for-doctors
javascript:void(0);
#
http://help.practo.com/practo-search/practo-relevance-algorithm
http://help.practo.com/practo-search/practo-ranking-algorithm
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara#services
https://www.practo.com
/health/login
/for-doctors
javascript:void(0);
#
http://help.practo.com/practo-search/practo-relevance-algorithm
http://help.practo.com/practo-search/practo-ranking-algorithm
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara#services
Please someone can help me why its happening, is there any other method through which i can able to scrap all thee links.... thanks in advance
try this one:
import urllib2
import re
url='https://www.practo.com/delhi/dentist?page=1'
resp = urllib2.urlopen(url)
s = resp.read()
regexp = r'http[^"]*"'
pattern = re.compile(regexp)
urls = re.findall(pattern, s)
for i in urls:
print i
This shall return all the http
links in that website:
from BeautifulSoup import BeautifulSoup
import urllib2
url='https://www.practo.com/delhi/dentist'
resp = urllib2.urlopen(url)
soup = BeautifulSoup(resp)
for i in soup.findAll('a',href = True):
link = i['href']
if link[:4] == 'http':
print link
Had the same problem and was able to fix it by changing the parser used with BeautifulSoup from lxml
to html.parser
:
#!/usr/bin/python3
from bs4 import BeautifulSoup
import urllib.request
import http.server
req = urllib.request.Request(url)
try:
with urllib.request.urlopen(req) as response:
html = response.read()
except urllib.error.HTTPError as e:
errorMsg = http.server.BaseHTTPRequestHandler.responses[e.code][0]
print("Cannot retrieve URL: {} : {}".format(str(e.code), errorMsg))
except urllib.error.URLError as e:
print("Cannot retrieve URL: {}".format(e.reason))
except:
print("Cannot retrieve URL: unknown error")
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
print("Link: {}".format(link['href']))
You can read more about the different parser in the documentation under Installing a parser .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.