For my Master Thesis I want to send a questionnaire to as many people as possible in the field (Early Childhood Education), so my goal is to scrape Emails from Dacare Centers (KiTa) from a public site. I am very new to Python, so while this seems trivial to most, it's proven to be quite a challenge for my level of knowledge. I'm also not familiar with the lingo, so I don't even know what I need to look for.
This is the site (German): https://www.kitanetz.de/
To get to the content I want, I have to first select a country ("Bundesland"), will be directed to the next level where I need to click "Kreise auflisten". Then I get to the next level, where all the small counties inside the Country are listed. Every link opens a next level of pages with postalcodes and profile links. Some of those profiles have Emails, some don't (found tutorials to let that be no problem).
It took me two days now to scrape postal codes and names of the centres from one of those pages. What do I need to do so Python is able to iterate through every country, every county and every profile to get to the links? If you know a ressource or a keyword I should look out for that'd be a great next step. I also haven't tried to put the data from this code in a dataframe using pandas yet, but my other attempts didn't work.
This is my attempt so far. I added ## to my comments/questions in the code. # are comments from the tutorial:
import requests
from bs4 import BeautifulSoup
## Here's the tutorial I was following: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup
# Step 1: Sending a HTTP request to a URL
url = requests.get("https://www.kitanetz.de/bezirke/bezirke.php?land=Baden-W%C3%BCrttemberg&kreis=Alb-Donau-Kreis")
# Step 2: Parse the html content
soup = BeautifulSoup(url.text, 'lxml')
# print(soup.prettify()) # print the parsed data of html
# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
## it says in the tutorial, but what does that actually do?
## Get the table inside the <div id="inhalt">
table = soup.find_all('table')[0]
## Get the data you want: PLZ, Name Kita (ids) and href to profiles
plz = table.find_all('td', attrs={"headers": "header2"})
ids = table.find_all('td', attrs={"headers": "header3"})
table_data = table.find_all("tr") ## contains 101 rows. row [0] is header, using th tags. Rows [1]:[101] use td tags
for link in table.find_all("a"):
print("Name: {}".format(link.text))
print("href: {}".format(link.get("href")))
# Get the headers of the list
t_headers = []
for th in table.find_all("th"):
# remove any newlines and extra spaces from left and right
t_headers.append(th.text.replace('\n', ' ').strip())
# Get all the rows of table
table_data = []
for tr in table.find_all('tr'): # find all tr's from table ## no, it doesn't
t_row = {}
# Each table row is stored in the form of
## t_row = {'.': '', 'PLZ': '', 'Name Kita': '', 'Alter', '', 'Profil':''}
## we want: t_row = {'PLZ':'', 'Name Kita': '', 'EMail': ''}. Emails are stored in the hrefs -> next layer
## how do I get my plz, ids and hrefs in one dataframe? I'd know in R but this here works different.
# find all td's(3) in tr and zip it with t_header
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
You can use the site's sitemap.xml
to get all links to profiles. When you have all links, then it's just simple parsing:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.kitanetz.de/sitemap.xml'
sitemap = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r'/\d+/[^/]+\.php')
for loc in sitemap.select('loc'):
if r.search(loc.text):
html_data = requests.get(loc.text).text
soup = BeautifulSoup(html_data, 'html.parser')
title = soup.h1.text
email = re.search(r"ez='(.*?)'.*?ey='(.*?)'.*?ex='(.*?)'", html_data, flags=re.S)
if email:
email = email[1] + '@' + email[2] + '.' + email[3]
else:
email = '-'
print('{:<60} {:<35} {}'.format(title, email, loc.text))
Prints:
Evangelisch-lutherische Kindertagessstätte Lemförde kts.lemfoerde@evlka.de https://www.kitanetz.de/niedersachsen/49448/stettiner-str-43b.php
Kindertagesstätte Stuhr I kiga.stuhr@stuhr.de https://www.kitanetz.de/niedersachsen/28816/stuhrer-landstrasse33a.php
Kita St. Bonifatius (Frankestraße) frankestr@kath-kita-wunstorf.de https://www.kitanetz.de/niedersachsen/31515/frankestrasse11.php
Ev. Kita Ketzin ektketzin.wagenschuetz@arcor.de https://www.kitanetz.de/brandenburg/14669/rathausstr17.php
Humanistische Kindertagesstätte `Die kleinen Strolche´ strolche@humanisten.de https://www.kitanetz.de/niedersachsen/30823/auf_der_horst115.php
Kindertagesstätte Idensen kita.idensen@wunstorf.de https://www.kitanetz.de/niedersachsen/31515/an_der_sigwardskirche2.php
Kindergroßtagespflege `Nesthäkchen´ nesthaekchen-isernhagen@gmx.de https://www.kitanetz.de/niedersachsen/30916/am_rathfeld4.php
Venhof Kindertagesstätte venhof@t-online.de https://www.kitanetz.de/niedersachsen/31515/schulstrasse14.php
Kindergarten Uetze `Buddelkiste´ buddelkiste@uetze.de https://www.kitanetz.de/niedersachsen/31311/eichendorffstrasse2b.php
Kita Lindenblüte m.herzog@lebenshilfe-dh.de https://www.kitanetz.de/niedersachsen/27232/lindern17.php
DRK Kita Luthe kita.luthe@drk-hannover.de https://www.kitanetz.de/niedersachsen/31515/an_der_boehmerke7.php
Freier Kindergarten Allerleirauh info@kindergarten-allerleirauh.de https://www.kitanetz.de/niedersachsen/31303/dachtmisser_weg3.php
Ev.-luth. Kindergarten St. Johannis johannis.bs.kita@lk-bs.de https://www.kitanetz.de/niedersachsen/38102/leonhardstr40.php
Kindertagesstätte Immensen-Arpke I kita.immensen@htp-tel.de https://www.kitanetz.de/niedersachsen/31275/am_schnittgraben15.php
SV Mörsen-Scharrendorf Mini-Club svms-mini-club@freenet.de https://www.kitanetz.de/niedersachsen/27239/am-sportheim6.php
Kindergarten Transvaal kiga-transvaal@awo-emden.de https://www.kitanetz.de/niedersachsen/26723/althusiusstr89.php
Städtische Kindertagesstätte Gartenstadt kita.gartenstadt@braunschweig.de https://www.kitanetz.de/niedersachsen/38122/wurmbergstr48.php
Kindergruppe Till Eulenspiegel e.V. - Bärenbande & Windelrocker tilleulenspiegel-bs@gmx.de https://www.kitanetz.de/niedersachsen/38102/kurt-schumacher-str7.php
Ev. luth. Kindertagesstätte der Versöhnun kts.versoehnung-garbsen@evlka.de https://www.kitanetz.de/niedersachsen/30823/im_alten_dorfe6.php
Kinderkrippe Ratzenspatz ratzenspatz@kila-ini.de https://www.kitanetz.de/niedersachsen/31535/am_goetheplatz5.php
Kinderkrippe Hemmingen-Westerfeld krippe-hw@stadthemmingen.de https://www.kitanetz.de/niedersachsen/30966/berliner_strasse16-22.php
... and so on.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.