简体   繁体   中英

Web-Scraping a list with multiple pages with beautifulsoup

I am trying to pull all the entries from this year from this page for analysis. I have written a python script, that does exactly that and it works great. However, it only works for the first 10 entries and I haven't been able to select a date range.

My best guess would have been to parse the page number in the URL, but as far as I can tell, it doesn't take any parameters. I have checked the "next page" buttons in the html doc and they do have a href link , but using that link always just throws me back to page 1.

Now I'm kind of stuck on this and really hoping someone can point me in the right direction.

Sorry for the vague question, I am new to web-scraping and HTML & Python so I'm not sure where to even start looking.

This is my current code:

import requests
from bs4 import BeautifulSoup
page = requests.get("https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?2")
soup = BeautifulSoup(page.content, 'html.parser')

uls = soup.find_all('ul')
antraege = uls[17].findAll('li')
atrgs = list()

for antrag in antraege:
    titel = antrag.find_all('a', class_='headline-link')[0].string.strip()
    beschlossen = antrag.find_all('div', class_='keyvalue-value')[0].string.strip()
    typ = antrag.find_all('div', class_='keyvalue-value')[1].string.strip()
    status = antrag.find_all('div', class_='keyvalue-value')[4].string.strip()
    ba = antrag.find_all('a', class_='icon_action')[0].string.strip()
    referat = antrag.find_all('a', class_='icon_action')[1].string.strip()
    atrg = Antrag(titel, beschlossen, typ, status, ba, referat)
    atrgs.append(atrg)

To get all pages you can use next example:

import requests
from bs4 import BeautifulSoup

headers = {
    "Wicket-Ajax": "true",
    "Wicket-Ajax-BaseURL": "antrag/ba/baantraguebersicht?0",
}


def get_info(soup):
    rv = []
    for lg in soup.select("li.list-group-item"):
        title = " ".join(lg.select_one(".headline-link").text.split("\r\n"))
        ba = lg.select_one(
            '.keyvalue-key:-soup-contains("Beschlossen am:") + div'
        ).text.strip()
        # get other info here
        # ...
        rv.append((title, ba))
    return rv


with requests.session() as s:
    # the first page:
    page = s.get("https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0")
    soup = BeautifulSoup(page.content, "html.parser")

    counter = 1
    while True:
        
        for title, ba in get_info(soup):
            print(counter, title, ba)
            counter += 1

        # is there next page?
        tag = soup.select_one('[title="Eine Seite vorwärts gehen"]')

        if not tag:
            # no, we are done here:
            break

        headers["Wicket-FocusedElementId"] = tag["id"]

        page = s.get(
            "https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0-1.0-color_container-list-cardheader-nav_top-next",
            headers=headers,
        )
        soup = BeautifulSoup(page.content, "xml")
        soup = BeautifulSoup(
            soup.select_one("ajax-response").text, "html.parser"
        )

Prints:

1 Auskunft über geplante Wohnbebauung westlich der Drygalski-Allee 02.08.2022
2 Bestellung einer städtischen Leistung: Finanzierung von Ferien- und Familienpässen für Einrichtun... 02.08.2022
3 Verzögerungen bei der Verlegung von Glasfaserkabeln im 19. Stadtbezirk 02.08.2022
4 Virtuelle Tagungsmöglichkeiten für Unterausschüsse weiter ermöglichen 02.08.2022
5 Offene Fragen zur Schließung des Maria-Einsiedel-Bades 02.08.2022

...

139 Bestellung einer städtischen Leistung, hier: Topo-Box-Einsatz in der Schröfelhofstraße / Ossinger... 11.07.2022
140 Bestellung einer städtischen Leistung, hier: Topo-Box-Einsatz in der Pfingstrosenstraße 11.07.2022
141 Bestellung einer städtischen Leistung; hier: Topo-Box-Einsatz in der Alpenveilchenstraße 11.07.2022

EDIT: To select a "Wahlperiod"

import requests
from bs4 import BeautifulSoup

headers = {
    "Wicket-Ajax": "true",
    "Wicket-Ajax-BaseURL": "antrag/ba/baantraguebersicht?0",
}


def get_info(soup):
    rv = []
    for lg in soup.select("li.list-group-item"):
        title = " ".join(lg.select_one(".headline-link").text.split("\r\n"))
        ba = lg.select_one(
            '.keyvalue-key:-soup-contains("Beschlossen am:") + div'
        ).text.strip()
        # get other info here
        # ...
        rv.append((title, ba))
    return rv


with requests.session() as s:
    # the first page:
    page = s.get("https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0")
    soup = BeautifulSoup(page.content, "html.parser")

    # select the "wahlperiod 01.05.2008 bis 30.04.2014"
    tag = soup.select_one(
        '.dropdown-item[title="Belegt die Datumsfelder mit dem Datumsbereich von 01.05.2008 bis 30.04.2014"]'
    )

    headers["Wicket-FocusedElementId"] = tag["id"]
    
    # for different wahlperiod change the `periodenEintrag-2` part
    page = s.get(
        "https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0-1.0-form-periodeButton-periodenEintrag-2-periode=&_=1660125170317",
        headers=headers,
    )

    # reload first page with new data:
    page = s.get("https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0")
    soup = BeautifulSoup(page.content, "html.parser")

    counter = 1
    while True:

        for title, ba in get_info(soup):
            print(counter, title, ba)
            counter += 1

        # is there next page?
        tag = soup.select_one('[title="Eine Seite vorwärts gehen"]')

        if not tag:
            # no, we are done here:
            break

        headers["Wicket-FocusedElementId"] = tag["id"]

        page = s.get(
            "https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0-2.0-color_container-list-cardheader-nav_top-next",
            headers=headers,
        )
        soup = BeautifulSoup(page.content, "xml")
        soup = BeautifulSoup(
            soup.select_one("ajax-response").text, "html.parser"
        )

Prints:


...

1836 Willkommen auf der Wärmeinsel München - begünstigt durch Nachverdichtung und Versiegelung in den ... 24.05.2012
1837 Kontingentplätze in Kitas der freien Träger 24.05.2012
1838 Brachliegende Grundstücke in der Messestadt 24.05.2012
1839 Mut zur nachhaltigen Gestaltung - externe kleinteilige B-Pläne  zulassen 24.05.2012
1840 Grundstücksverkauf 4. Bauabschnitt Wohnen in der Messestadt 24.05.2012

...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM