简体   繁体   中英

Can't find table tag in HTML from webscraping with python beautifulsoup and how to scrape a table with many pages

I have 2 separated questions:


Question 1

I'm trying to scrape some tables from this website . See the attached image below.

在此处输入图像描述

So, I made this code until here:

from bs4 import BeautifulSoup
import requests


url = 'https://transparencia.registrocivil.org.br/registros'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}

source = requests.get(url, headers=headers).text

soup = BeautifulSoup(source, 'html.parser')

table = soup.find('table')

print(table.prettify())

This code isn't working and the table returned is NoneType . It seems that BeautifulSoup can't find it. What am I doing wrong to scrape the table?

Once this is done, I'll explain the second part of my question:


Question 2

My main idea is to scrape data using the selectors from the image, referring each year, month, region, state to scrape city-data. 在此处输入图像描述

Some of those tables are large and are distributed into pages, as you can see at the end of some tables in the website. How could I run all of those pages to get data all together for each year, month, region and state?

在此处输入图像描述

I believe the data is loaded dynamically so I would suggest using Selenium to scrape the data. I am not sure if BeautifulSoup can handle dynamic data on websites.

from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://transparencia.registrocivil.org.br/registros")
table = browser.find_element_by_css_selector("table")
for elem in table.find_elements_by_css_selector('tr'):
    print(elem.text)

You can get the data you want without BeautifulSoup or Selenium.

If you open Google Chrome's Developer Console, and log your network traffic - and filter the log to only view XHR resources, you will see that your browser makes requests to a web API, the response of which is JSON containing all the data you could ever want.

Looking at the requests closer, the API only accepts these requests if the request headers contain a valid User-Agent field and an XSRF token, which is just a cookie.

So, you have to:

  1. Determine which resource installs the XSRF token cookie to your browser's session using Set-Cookie in the response headers.
  2. Make a request to that resource and get the XSRF token.
  3. Make a request to another resource which contains a list of all cities in all states. Use this information to generate a set of all states.
  4. Iterate through all states. For each state, make a request to another resource which contains the total record count (Births, Marriages and Deaths) for each city in that state.

Code:

user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"

def get_cookie():

    import requests
    import re

    url = "https://transparencia.registrocivil.org.br/registros"

    headers = {
        "User-Agent": user_agent
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    return requests.utils.unquote(re.match("XSRF-TOKEN=([^;]+)", response.headers["Set-Cookie"]).group())

def get_states(cookie):

    import requests

    url = "https://transparencia.registrocivil.org.br/api/cities"

    headers = {
        "User-Agent": user_agent,
        "X-XSRF-TOKEN": cookie
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    return set(city["uf"] for city in response.json()["cities"])

def get_next_state_results(cookie):

    import requests

    url = "https://transparencia.registrocivil.org.br/api/record/filter-all"

    headers = {
        "User-Agent": user_agent,
        "X-XSRF-TOKEN": cookie
    }

    for state in get_states(cookie):

        params = {
            "start_date": "2020-01-01",
            "end_date": "2020-12-31",
            "state": state
        }

        response = requests.get(url, params=params, headers=headers)
        response.raise_for_status()

        for item in response.json()["data"]:
            yield item

def main():

    cookie = get_cookie()

    for result in get_next_state_results(cookie):
        print(f"{result['name']}: {result['total']}")

    return 0

if __name__ == "__main__":
    import sys
    sys.exit(main())

You can modify the start_date and end_date query string parameters in the params dict in the get_next_state_results generator to change the month and year.

Here is some of the output. The output is very long, so here is just the first few lines:

Abaré: 137
Acajutiba: 51
Aiquara: 31
Alagoinhas: 1153
Alcobaça: 174
Almadina: 23
Amargosa: 184
Amelia Rodrigues: 171
América Dourada: 120
Anagé: 78
Andaraí: 122
Andorinha: 53
Angical: 65
Anguera: 45
Antas: 106
Antônio Gonçalves: 70
Araças: 82
Aracatu: 95
Araci: 293
Aramari: 39
Aratuípe: 37
Aurelino Leal: 88
Baianópolis: 101
Baixa Grande: 126
Barra: 306
Barra da Estiva: 352
Barra do Choça: 235
Barra do Mendes: 81
Barra do Rocha: 18
Barreiras: 1902
Barro Alto: 86
Barro Preto: 45
Belmonte: 109
Belo Campo: 105
Boa Nova: 83

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM