简体   繁体   中英

Can't get HTML source of a Javascript Generated page with selenium

So I got this page ( https://www.ssn.gob.ar/storage/registros/productores/productoresactivosfiltro.asp ) from where I want to extract data.

You can get the data of the people by only putting numbers in the "Matricula" field, that part is easy, but when it generates the new page and I want to get get the data from a specific div it gives you NONE, and checking the HTML it use to browse the data, it's the same as the page I'm putting the numbers to access the data.

import os
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def clear(): return os.system("cls")

options =  webdriver.ChromeOptions()
options.add_argument('--start-maximized')
options.add_argument('--disable-extensions')

driver_path = 'C:\\Users\\Menem Lo Hizo\\Downloads\\chromedriver_win32\\chromedriver.exe'

driver = webdriver.Chrome(driver_path, chrome_options=options)

driver.get('https://www.ssn.gob.ar/storage/registros/productores/productoresactivosfiltro.asp')

matricula = driver.find_element_by_id("matricula")

matricula.send_keys("2")
matricula.send_keys(Keys.RETURN)

try:
    div = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "col-md-8 col-md-offset-2"))
    )
except:
    driver.quit()

clear()
print(div)

This is my code.

few things :

  1. You need explicit waits

  2. When You hit enter on first page, a new tabs opens up, you need to switch to that windows

Code :

driver.get("https://www.ssn.gob.ar/storage/registros/productores/productoresactivosfiltro.asp")
wait = WebDriverWait(driver, 10)
org_handles = driver.window_handles
wait.until(EC.element_to_be_clickable((By.ID, "matricula"))).send_keys("2" + Keys.RETURN)
new_handles = driver.window_handles
driver.switch_to.window(new_handles[1])
div  = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".col-md-8.col-md-offset-2")))
print(div.text)

Logging ones network traffic when submitting the form reveals an HTTP POST request being made to productoresactivos.asp , the response of which is HTML. Simply imitate that request:

def get_columns():
    import requests
    from bs4 import BeautifulSoup as Soup

    url = "https://www.ssn.gob.ar/storage/registros/productores/productoresactivos.asp"

    payload = {
        "socpro": "PAS",
        "matricula": "2",
        "apellidorazonsocial": "",
        "docNro": "",
        "Submit": "Buscar"
    }

    response = requests.post(url, data=payload)
    response.raise_for_status()

    soup = Soup(response.content, "html.parser")

    for column in soup.select("div[class^=\"col-md-\"]"):
        yield " ".join(column.get_text().strip().split())


def main():
    for text in get_columns():
        print(text)
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

Página 1 de 1
Matrícula: 2
Nombre: CABELLO DE GADANO, MARIA CRISTINA
Documento: DNI - 5263977
CUIT: 27-05263977-3
Ramo: PATRIMONIALES Y VIDA
Domicilio: AV. CORDOBA 669 12º B
Localidad: CIUDAD AUTONOMA BS.AS.
Provincia CIUDAD AUTONOMA
Cod. Postal: 1054
Teléfonos: 4311-5860
E-mail:
Nro. de Resolución 17053
Fº de Resolución 06/01/1983
Nro. de Libro: 01
Nro. de Rubrica: 20395
Fº. de Rubrica: 21/08/1992
Nro. de Libro: 1
Fº. de Rubrica: 20396
Fº. de Rubrica: 21/08/1992
>>> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM