简体   繁体   中英

Web scraping with Python/BeautifulSoup: Site with multiple links to profiles > needing profile contents

For my Master Thesis I want to send a questionnaire to as many people as possible in the field (Early Childhood Education), so my goal is to scrape Emails from Dacare Centers (KiTa) from a public site. I am very new to Python, so while this seems trivial to most, it's proven to be quite a challenge for my level of knowledge. I'm also not familiar with the lingo, so I don't even know what I need to look for.

This is the site (German): https://www.kitanetz.de/

To get to the content I want, I have to first select a country ("Bundesland"), will be directed to the next level where I need to click "Kreise auflisten". Then I get to the next level, where all the small counties inside the Country are listed. Every link opens a next level of pages with postalcodes and profile links. Some of those profiles have Emails, some don't (found tutorials to let that be no problem).

It took me two days now to scrape postal codes and names of the centres from one of those pages. What do I need to do so Python is able to iterate through every country, every county and every profile to get to the links? If you know a ressource or a keyword I should look out for that'd be a great next step. I also haven't tried to put the data from this code in a dataframe using pandas yet, but my other attempts didn't work.

This is my attempt so far. I added ## to my comments/questions in the code. # are comments from the tutorial:

    import requests
from bs4 import BeautifulSoup

## Here's the tutorial I was following: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup

# Step 1: Sending a HTTP request to a URL
url = requests.get("https://www.kitanetz.de/bezirke/bezirke.php?land=Baden-W%C3%BCrttemberg&kreis=Alb-Donau-Kreis")

# Step 2: Parse the html content
soup = BeautifulSoup(url.text, 'lxml')
# print(soup.prettify()) # print the parsed data of html

# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
## it says in the tutorial, but what does that actually do? 

## Get the table inside the <div id="inhalt">
table = soup.find_all('table')[0]

## Get the data you want: PLZ, Name Kita (ids) and href to profiles
plz = table.find_all('td', attrs={"headers": "header2"})
ids = table.find_all('td', attrs={"headers": "header3"})

table_data = table.find_all("tr")  ## contains 101 rows. row [0] is header, using th tags. Rows [1]:[101] use td tags

for link in table.find_all("a"):
    print("Name: {}".format(link.text))
    print("href: {}".format(link.get("href")))
    
# Get the headers of the list
t_headers = []
for th in table.find_all("th"):
    # remove any newlines and extra spaces from left and right
    t_headers.append(th.text.replace('\n', ' ').strip())
    
# Get all the rows of table
table_data = []
for tr in table.find_all('tr'): # find all tr's from table ## no, it doesn't
    t_row = {}
    # Each table row is stored in the form of
    ## t_row = {'.': '', 'PLZ': '', 'Name Kita': '', 'Alter', '', 'Profil':''}
    ## we want: t_row = {'PLZ':'', 'Name Kita': '', 'EMail': ''}. Emails are stored in the hrefs -> next layer
    ## how do I get my plz, ids and hrefs in one dataframe? I'd know in R but this here works different.

    # find all td's(3) in tr and zip it with t_header
    for td, th in zip(tr.find_all("td"), t_headers): 
        t_row[th] = td.text.replace('\n', '').strip()
        table_data.append(t_row)

 

You can use the site's sitemap.xml to get all links to profiles. When you have all links, then it's just simple parsing:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://www.kitanetz.de/sitemap.xml'

sitemap = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r'/\d+/[^/]+\.php')
for loc in sitemap.select('loc'):
    if r.search(loc.text):
        html_data = requests.get(loc.text).text
        soup = BeautifulSoup(html_data, 'html.parser')

        title = soup.h1.text

        email = re.search(r"ez='(.*?)'.*?ey='(.*?)'.*?ex='(.*?)'", html_data, flags=re.S)
        if email:
            email = email[1] + '@' + email[2] + '.' + email[3]
        else:
            email = '-'

        print('{:<60} {:<35} {}'.format(title, email, loc.text))

Prints:

Evangelisch-lutherische Kindertagessstätte Lemförde          kts.lemfoerde@evlka.de              https://www.kitanetz.de/niedersachsen/49448/stettiner-str-43b.php
Kindertagesstätte Stuhr I                                    kiga.stuhr@stuhr.de                 https://www.kitanetz.de/niedersachsen/28816/stuhrer-landstrasse33a.php
Kita St. Bonifatius (Frankestraße)                           frankestr@kath-kita-wunstorf.de     https://www.kitanetz.de/niedersachsen/31515/frankestrasse11.php
Ev. Kita Ketzin                                              ektketzin.wagenschuetz@arcor.de     https://www.kitanetz.de/brandenburg/14669/rathausstr17.php
Humanistische Kindertagesstätte `Die kleinen Strolche´       strolche@humanisten.de              https://www.kitanetz.de/niedersachsen/30823/auf_der_horst115.php
Kindertagesstätte Idensen                                    kita.idensen@wunstorf.de            https://www.kitanetz.de/niedersachsen/31515/an_der_sigwardskirche2.php
Kindergroßtagespflege `Nesthäkchen´                          nesthaekchen-isernhagen@gmx.de      https://www.kitanetz.de/niedersachsen/30916/am_rathfeld4.php
Venhof Kindertagesstätte                                     venhof@t-online.de                  https://www.kitanetz.de/niedersachsen/31515/schulstrasse14.php
Kindergarten Uetze `Buddelkiste´                             buddelkiste@uetze.de                https://www.kitanetz.de/niedersachsen/31311/eichendorffstrasse2b.php
Kita Lindenblüte                                             m.herzog@lebenshilfe-dh.de          https://www.kitanetz.de/niedersachsen/27232/lindern17.php
DRK Kita Luthe                                               kita.luthe@drk-hannover.de          https://www.kitanetz.de/niedersachsen/31515/an_der_boehmerke7.php
Freier Kindergarten Allerleirauh                             info@kindergarten-allerleirauh.de   https://www.kitanetz.de/niedersachsen/31303/dachtmisser_weg3.php
Ev.-luth. Kindergarten St. Johannis                          johannis.bs.kita@lk-bs.de           https://www.kitanetz.de/niedersachsen/38102/leonhardstr40.php
Kindertagesstätte Immensen-Arpke I                           kita.immensen@htp-tel.de            https://www.kitanetz.de/niedersachsen/31275/am_schnittgraben15.php
SV Mörsen-Scharrendorf Mini-Club                             svms-mini-club@freenet.de           https://www.kitanetz.de/niedersachsen/27239/am-sportheim6.php
Kindergarten Transvaal                                       kiga-transvaal@awo-emden.de         https://www.kitanetz.de/niedersachsen/26723/althusiusstr89.php
Städtische Kindertagesstätte Gartenstadt                     kita.gartenstadt@braunschweig.de    https://www.kitanetz.de/niedersachsen/38122/wurmbergstr48.php
Kindergruppe Till Eulenspiegel e.V. - Bärenbande & Windelrocker tilleulenspiegel-bs@gmx.de          https://www.kitanetz.de/niedersachsen/38102/kurt-schumacher-str7.php
Ev. luth. Kindertagesstätte der Versöhnun                    kts.versoehnung-garbsen@evlka.de    https://www.kitanetz.de/niedersachsen/30823/im_alten_dorfe6.php
Kinderkrippe Ratzenspatz                                     ratzenspatz@kila-ini.de             https://www.kitanetz.de/niedersachsen/31535/am_goetheplatz5.php
Kinderkrippe Hemmingen-Westerfeld                            krippe-hw@stadthemmingen.de         https://www.kitanetz.de/niedersachsen/30966/berliner_strasse16-22.php

... and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM