简体   繁体   中英

Press Button with POST request and scraping data from popup in Python

I would like to press the "Suche starten" Button and scrape the results for a research project from this page (Basically it can be pressed without filling in any forms - then a popup opens, that holds the data I want).

https://www.insolvenzbekanntmachungen.de/cgi-bin/bl_suche.pl

Basically it is the German public announcement of companies that go bankrupt. I have already spent some considerable time trying to get it going but somehow I can't get it to work. I know I could also try the selenium headless browser but first of all I'd prefer the cleaner requests solution and second I'd love to be able to run the script continuously from a server with little effort and without a screen.

So what I have done so far is, to check out the post request my browser is sending using the Firefox Dev Tools and tried to emulate the Post request. The problem is that I can only get the standard data from the initial window but not from the opening up Window which holds all the data I want.

So I imported the requests library and created a custom request with header and payload.

headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0',
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Length": "413",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "www.insolvenzbekanntmachungen.de",
"Pragma": "no-cache",
"Referer": "https://www.insolvenzbekanntmachungen.de/cgi-bin/bl_suche.pl",
"Upgrade-Insecure-Requests": "1"
}

payload={
'Suchfunktion': 'uneingeschr',
'Absenden': 'Suche+starten',
'Bundesland': '-Hamburg',
'Gericht': 'Hamburg',
'Datum1':'',
'Datum2':'',
'Name':'',
'Sitz':'',
'Abteilungsnr':'',
'Registerzeichen': '--',
'Lfdnr':'',
'Jahreszahl': '--',
'Registerart': '--+keine+Angabe+--',
'select_registergericht':'',
'Registergericht': '--+keine+Angabe+--',
'Registernummer':'',
'Gegenstand': '--+Alle+Bekanntmachungen+innerhalb+des+Verfahrens+--',
'matchesperpage': '10',
'page': '1',
'sortedby': 'Datum',
'submit': 'return validate_globe(this)',
}

And then i make The following request:

r = requests.post('https://www.insolvenzbekanntmachungen.de/cgi-bin/bl_suche.pl',headers=headers,data=payload)

Unfortunately print(r.text) will not give me the data from the popup that would appear in a browser.

Any help would be very greatly appreciated!

Jasper

Quick and easy fix would be something like below. Give it a go:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.insolvenzbekanntmachungen.de/cgi-bin/bl_suche.pl'
payload = 'Suchfunktion=uneingeschr&Absenden=Suche+starten&Bundesland=--+Alle+Bundesl%E4nder+--&Gericht=--+Alle+Insolvenzgerichte+--&Datum1=&Datum2=&Name=&Sitz=&Abteilungsnr=&Registerzeichen=--&Lfdnr=&Jahreszahl=--&Registerart=--+keine+Angabe+--&select_registergericht=&Registergericht=--+keine+Angabe+--&Registernummer=&Gegenstand=--+Alle+Bekanntmachungen+innerhalb+des+Verfahrens+--&matchesperpage=10&page=1&sortedby=Datum'

with requests.Session() as s:
    s.headers={"User-Agent":"Mozilla/5.0"}
    s.headers.update({'Content-Type': 'application/x-www-form-urlencoded'})
    res = s.post(URL, data = payload)
    soup = BeautifulSoup(res.text, "lxml")
    for item in soup.select("b li a"):
        print(item.get_text(strip=True))

Output:

2018-07-05A & A Eco Clean Gebäudereinigung GmbH, München, 1503 IN 1836/16, Registergericht München, HRB 189121
2018-07-05A & A Eco Clean Gebäudereinigung GmbH, München, 1503 IN 1836/16, Registergericht München, HRB 189121
2018-07-05A + S Wohnungsbau Besitz GmbH & Co.KG, Kandel, 3 IN 96/12, Registergericht Landau in der Pfalz, HRA 21214
2018-07-05Abb Nicola, Untersöchering, IN 462/11
2018-07-05Abb Nicola, Untersöchering, IN 462/11
2018-07-05Abdul Basit Qureshi, Kirchheim, 13 IN 23/17
2018-07-05Abdul Basit Qureshi, Kirchheim, 13 IN 23/17
2018-07-05Abdul Basit Qureshi, Kirchheim, 13 IN 23/17
2018-07-05Abdulrahman, Oulat, Bottrop, 162 IN 76/12
2018-07-05Abdurachid Hassan, München, 1500 IK 2170/17

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM