简体   繁体   中英

Function to web scrape tables from several pages

I am learning Python and I am trying to create a function to web scrape tables of vaccination rates from several different web pages - a github repository for Our World in Data https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data and https://ourworldindata.org/about . The code works perfectly when web scraping a single table and saving it into a data frame...

import requests  
from bs4 import BeautifulSoup 
import pandas as pd

url = "https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/country_data/Bangladesh.csv"
response = requests.get(url) 
response

scraping_html_table_BD = BeautifulSoup(response.content, "lxml") 
scraping_html_table_BD = scraping_html_table_BD.find_all("table", "js-csv-data csv-data js-file-line-container")
df = pd.read_html(str(scraping_html_table_BD))
BD_df = df[0]

But I have not had much luck when trying to create a function to scrape several pages. I have been following the tutorial on this website 3 in the section 'Scrape multiple pages with one script' and StackOverflow questions like 4 and 5 amongst other pages. I have tried creating a global variable first but I end up with errors like "Recursion Error: maximum recursion depth exceeded while calling a Python object". This is the best code I have managed as it doesn't generate an error but I've not managed to save the output to a global variable. I really appreciate your help.

import pandas as pd  
from bs4 import BeautifulSoup
import requests

link_list = ['/Bangladesh.csv',
             '/Nepal.csv',
              '/Mongolia.csv']

def get_info(page_url):
    page = requests.get('https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data' + page_url)
    scape = BeautifulSoup(page.text, 'html.parser')    
    vaccination_rates = scape.find_all("table", "js-csv-data csv-data js-file-line-container")
    result = {}

    df = pd.read_html(str(vaccination_rates))
    vaccination_rates = df[0]
    df = pd.DataFrame(vaccination_rates)
    print(df)
    df.to_csv("testdata.csv", index=False)

     
for link in link_list:
    get_info(link)

edit: I can view the final webpage that is iterated as it saves to a csv file, but not the data from the preceding links.

new = pd.read_csv('testdata6.csv')
pd.set_option("display.max_rows", None, "display.max_columns", None)
new

This is because in every iteration your 'testdata.csv' is overwritten with a new one. so you can do: df.to_csv(page_url[1:], index=False)

I'm guessing you're overwriting your 'testdata.csv' each time, hence why you can see the final page. I would either add an enumerate function to add an identifier for a separate csv each time you scrape a page, eg:

for key, link in enumerate(link_list):
get_info(link, key)
...
df.to_csv(f"testdata{key}.csv", index=False)

Or, open this csv as part of your get_info function, steps of which are available in append new row to old csv file python .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM