简体   繁体   中英

Python Web Scraping: URL Pagination

A friend and I are working on this web scraper for the Michigan Campaign Finance website. We want to achieve pagination on this tool but are not sure how to go about it. Right now the code successfully scrapes and writes to a csv but only does so for the specified page in the url (see url link below). Can anyone help us achieve pagination on this tool? I have tried the.format() and for loop methods with no luck. My code is below.

https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page=1

import requests
import requests_cache
import lxml.html as lh
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen

base_url = 'https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page=11'

#requests_cache.install_cache(cache_name='whitmer_donor_cache', backend='sqlite', expire_after=180)

#Scrape Table Cells
page = requests.get(base_url)

doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
#print([len(T) for T in tr_elements[:12]])

#Parse Table Header
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
    i += 1
    name = t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[]))

###Create Pandas Dataframe###
for j in range(1,len(tr_elements)):
    T = tr_elements[j]
    if len(T)!=9:
        break
    i = 0
    for t in T.iterchildren():
        data = t.text_content() 
        if i>0:
            try:
                data = int(data)
            except:
                pass
        col[i][1].append(data)
        i+=1
#print([len(C) for (title,C) in col])

###Format Dataframe###
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)
df = df.replace('\n','', regex=True)
df = df.replace('  ', ' ', regex=True)
df['Receiving Committee'] = df['Receiving Committee'].apply(lambda x : x.strip().capitalize())

###Print Dataframe###
with pd.option_context('display.max_rows', 10, 'display.max_columns', 10):  # more options can be specified also
    print(df)

df.to_csv('Whitmer_Donors.csv', mode='a', header=False)

#create excel writer
#writer = pd.ExcelWriter("Whitmer_Donors.xlsx")

#write dataframe to excel#
#df.to_excel(writer)
#writer.save()
print("Dataframe is written successfully to excel")

Any recommendations on how to proceed?

You mentioned using .format() but I don't see that anywhere in the code you provided. The URL given has a page parameter which you can use with str.format() :

# note the braces at the end
base_url = 'https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page={}'

for page_num in range(1, 100):
    url = base_url.format(page_num)
    page = requests.get(url)  # use `url` here, not `base_url`
    ...  # rest of your code

Ideally, you'll want to do keep increasing page_num without setting an upper limit and break if you get a 404 result or any error.

page_num = 0
while True:
    page_num += 1
    url = base_url.format(page_num)
    page = requests.get(url)  # use `url` here, not `base_url`
    if 400 <= page.status_code < 600:  # client errors or server errors
        break
    ...  # rest of your code

I highly recommend that you put various parts of your script into reusable functions which you can call with different parameters. Split it into smaller more manageable pieces for ease of use and debugging.

I would suggest using requests.get params argument, like this:

params = {"schedule": "1A", "changes": '0', "page": "1"}
page = requests.get(base_url, params=params)

It will create the correct URL automatically for you.

Also, in order to get all the pages you can just loop over them. When you hit an empty dataframe, you assume that all data has been downloaded and you exit the loop. I've implemented a for loop with 41 iterations since I know how many pages there are, but if you don't know - you can set a very high number. In case you don't want "magic" numbers in your code, just use a while loop. But be careful not to go into an enless one...

I've taken the liberty to alter a bit your code into a more functional approach. Going forward, you might want to modularize it further.

import requests
import requests_cache
import lxml.html as lh
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen

base_url = 'https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions'

#requests_cache.install_cache(cache_name='whitmer_donor_cache', backend='sqlite', expire_after=180)


def get_page(page_url, params):
    #Scrape Table Cells
    page = requests.get(page_url, params=params)
    print(page.text)

    doc = lh.fromstring(page.content)
    tr_elements = doc.xpath('//tr')
    #print([len(T) for T in tr_elements[:12]])

    #Parse Table Header
    tr_elements = doc.xpath('//tr')
    col = []
    i = 0
    for t in tr_elements[0]:
        i += 1
        name = t.text_content()
        print('%d:"%s"' % (i, name))
        col.append((name, []))
    # print(col)

    ###Create Pandas Dataframe###
    for j in range(1, len(tr_elements)):
        T = tr_elements[j]
        if len(T) != 9:
            break
        i = 0
        for t in T.iterchildren():
            data = t.text_content().strip()
            if i > 0:
                try:
                    data = int(data)
                except:
                    pass
            col[i][1].append(data)
            i += 1
        # print(col[0:3])
    #print([len(C) for (title,C) in col])

    ###Format Dataframe###
    Dict = {el[0]: el[1] for el in col}
    Dict = {title: column for (title, column) in col}
    print(col[1])
    print(Dict.keys())
    df = pd.DataFrame(Dict)
    df = df.replace('\n', '', regex=True)
    df = df.replace('  ', ' ', regex=True)
    df['Receiving Committee'] = df['Receiving Committee'].apply(
        lambda x: x.strip().capitalize())

    ###Print Dataframe###
    with pd.option_context('display.max_rows', 10, 'display.max_columns',
                           10):  # more options can be specified also
        print(df)

    return df


def get_all_pages(base_url):
    df_list = []
    for i in range(1, 42):
        params = {"schedule": "1A", "changes": '0', "page": str(i)}
        df = get_page(base_url, params)
        print(df)
        if df.empty:
            print("Empty dataframe! All done.")
            break
        df_list.append(df)
        print(df)
        print('====================================')
    return df_list


df_list = get_all_pages(base_url)
pd.concat(df_list).to_csv('Whitmer_Donors.csv', mode='w', header=False)

#create excel writer
#writer = pd.ExcelWriter("Whitmer_Donors.xlsx")

#write dataframe to excel#
#df.to_excel(writer)
#writer.save()
print("Dataframe is written successfully to excel")

Here's a slightly different implementation. Use read_html() to get the table directly to pandas, and then use soup to find the next page. If there's not a next page, the program will exit. This page you are scraping has 40 pages, so start at 38 for example and it'll quit and print the df with 300 rows. Any mods to the dataframe you can do at the end.

# this function looks for the next page url; returns None if it isn't there
def parse(soup):   
    try:
        return json.loads(soup.find('search-results').get(':pagination'))['next_page_url']
    except:
        return None


start_urls = ['https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page=38'] # change to 1 for the full run
df_hold_list = [] # collect your dataframes to concat later

for url in start_urls: # you can iterate through different urls or just the one
    page = requests.get(url)
    soup = BeautifulSoup(page.text, "html.parser")
    df = pd.read_html(url)[0]
    df_hold_list.append(df)
    next = parse(soup)

    while True:
        if next:
            print(next)
            page = requests.get(next)
            soup = BeautifulSoup(page.text, "html.parser")
            df = pd.read_html(url)[0]
            df_hold_list.append(df)
            next = parse(soup)
        else:
            break

df_final = pd.concat(df_hold_list)
df_final.shape

(300, 9) # 300 rows, 9 columns

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM