I have started a private project: web-scraping with Python and BeautifulSoup in Visual Studio Code (1.41.0).
I was able to scrape another site with the same structure as my "problem site". However now I have encountered, that BeautifulSoup doesn't find all div tags (there should be 20 per site and I find just 3 of them). I have informed myself on Stack Overflow but did not find the solution (or obviously didn't understand it).
Website: https://www.comparis.ch/gesundheit/arzt/pathologie
The html structure I'm interested in looks like this:
I get all the <div class="css-15dj4ut"></div>
from the <div class="css-fh99y9 excbu0j0">...</div>
but none from the <div class="css-roynbj excbu0j0"></div>
. Do you have any idea why?
In iterate over every url to get to each site.
for i in range(0, endIndex):
try:
if i == 0:
urls.append(basicUrl)
page = urllib.request.urlopen(urls[i])
soup = BeautifulSoup(page, 'html.parser')
getSurgeonName(soup)
else:
urls.append(basicUrl + urlAddon + str(i + 1))
page = urllib.request.urlopen(urls[i])
soup = BeautifulSoup(page, 'html.parser')
getSurgeonName(soup)
except:
print("An URL request error occured.")
Function Version 1:
def getSurgeonName(soup):
# gets just first 3 surgeons of site
docName = re.compile('css-15dj4ut')
docNameTags = soup.find_all('div', attrs={'class': docName})
for a in docNameTags:
docNameList.append(a.getText())
Function Version 2:
def getSurgeonName(soup):
parentClass = re.compile('css-fh99y9 excbu0j0')
parentItems = soup.find_all('div', attrs={'class': parentClass})
for parent in parentItems:
children = parent.findChildren('div', {"class": "css-15dj4ut"})
docNameList.append(children[0].getText())
parentClass = re.compile('css-roynbj excbu0j0')
parentItems = soup.find_all('div', attrs={'class': parentClass})
for parent in parentItems:
children = parent.findChildren('div', {'class': 'css-15dj4ut'})
docNameList.append(children[0].getText())
Actually your desired desired
data is loaded via JavaScript
dynamically which the page loads, therefor requests
package will not be able render JavaScript
on the fly. But I've been able to locate the script
tag which is holding the data in string
of JSON
dict
, then loaded it into JSON
.
Here you can parse whatever you want:).
import requests
from bs4 import BeautifulSoup
import json
r = requests.get("https://www.comparis.ch/gesundheit/arzt/pathologie")
soup = BeautifulSoup(r.content, 'html.parser')
script = soup.find("script", {'id': '__NEXT_DATA__'}).text
data = json.loads(script)
print(data.keys()) # JSON Dict
dumper = json.dumps(data, indent=4)
print(dumper) # to see it in human readble format
Something like:
for item in data['props']['pageProps']['doctorResults']['doctorModels']:
print(item['name'])
Output:
Mohamed Abdou
Dr. med. Heiner Adams
Dr. med. Franziska Aebersold
Prof. Dr. med. Adriano Aguzzi
Dr. med. Maria Ammann
Prosper Anani
Dr. med. Max Arnaboldi
Dr. med. Walter Arnold
Dr. med. Irena Baltisser
Dr. med. Fridolin Bannwart
Dr. med. Yara Banz
Dr. med. André Barghorn
Dr. Jessica Barizzi
Prof. Dr. med. Daniel Baumhoer
Audrey Baur Chaubert
Dr. med. Christian Georg Bayerl
Dr. med. Marc Beer
Dr. med. Sabina Berezowska
Dr. med. Steffen Bergelt
Dr. med. Barbara Elisabeth Berger-Denzler
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.