简体   繁体   English

Python Web 抓取:URL 分页

[英]Python Web Scraping: URL Pagination

A friend and I are working on this web scraper for the Michigan Campaign Finance website.我和一个朋友正在为密歇根竞选财务网站开发这款 web 刮板。 We want to achieve pagination on this tool but are not sure how to go about it.我们想在这个工具上实现分页,但不知道如何 go 关于它。 Right now the code successfully scrapes and writes to a csv but only does so for the specified page in the url (see url link below).现在,代码成功地抓取并写入 csv,但仅针对 url 中的指定页面执行此操作(请参阅下面的 url 链接)。 Can anyone help us achieve pagination on this tool?谁能帮助我们在这个工具上实现分页? I have tried the.format() and for loop methods with no luck.我已经尝试了 .format() 和 for 循环方法,但没有成功。 My code is below.我的代码如下。

https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page=1 https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page=1

import requests
import requests_cache
import lxml.html as lh
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen

base_url = 'https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page=11'

#requests_cache.install_cache(cache_name='whitmer_donor_cache', backend='sqlite', expire_after=180)

#Scrape Table Cells
page = requests.get(base_url)

doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
#print([len(T) for T in tr_elements[:12]])

#Parse Table Header
tr_elements = doc.xpath('//tr')
col = []
i = 0
for t in tr_elements[0]:
    i += 1
    name = t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[]))

###Create Pandas Dataframe###
for j in range(1,len(tr_elements)):
    T = tr_elements[j]
    if len(T)!=9:
        break
    i = 0
    for t in T.iterchildren():
        data = t.text_content() 
        if i>0:
            try:
                data = int(data)
            except:
                pass
        col[i][1].append(data)
        i+=1
#print([len(C) for (title,C) in col])

###Format Dataframe###
Dict = {title:column for (title,column) in col}
df = pd.DataFrame(Dict)
df = df.replace('\n','', regex=True)
df = df.replace('  ', ' ', regex=True)
df['Receiving Committee'] = df['Receiving Committee'].apply(lambda x : x.strip().capitalize())

###Print Dataframe###
with pd.option_context('display.max_rows', 10, 'display.max_columns', 10):  # more options can be specified also
    print(df)

df.to_csv('Whitmer_Donors.csv', mode='a', header=False)

#create excel writer
#writer = pd.ExcelWriter("Whitmer_Donors.xlsx")

#write dataframe to excel#
#df.to_excel(writer)
#writer.save()
print("Dataframe is written successfully to excel")

Any recommendations on how to proceed?关于如何进行的任何建议?

You mentioned using .format() but I don't see that anywhere in the code you provided.您提到使用.format()但我在您提供的代码中的任何地方都没有看到。 The URL given has a page parameter which you can use with str.format() :给定的 URL 有一个page参数,您可以将其与str.format()一起使用:

# note the braces at the end
base_url = 'https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page={}'

for page_num in range(1, 100):
    url = base_url.format(page_num)
    page = requests.get(url)  # use `url` here, not `base_url`
    ...  # rest of your code

Ideally, you'll want to do keep increasing page_num without setting an upper limit and break if you get a 404 result or any error.理想情况下,如果您得到 404 结果或任何错误,您将希望在不设置上限的情况下继续增加page_numbreak

page_num = 0
while True:
    page_num += 1
    url = base_url.format(page_num)
    page = requests.get(url)  # use `url` here, not `base_url`
    if 400 <= page.status_code < 600:  # client errors or server errors
        break
    ...  # rest of your code

I highly recommend that you put various parts of your script into reusable functions which you can call with different parameters.我强烈建议您将脚本的各个部分放入可以使用不同参数调用的可重用函数中。 Split it into smaller more manageable pieces for ease of use and debugging.将其拆分为更小更易于管理的部分,以便于使用和调试。

I would suggest using requests.get params argument, like this:我建议使用requests.get params参数,如下所示:

params = {"schedule": "1A", "changes": '0', "page": "1"}
page = requests.get(base_url, params=params)

It will create the correct URL automatically for you.它将自动为您创建正确的 URL。

Also, in order to get all the pages you can just loop over them.此外,为了获取所有页面,您可以遍历它们。 When you hit an empty dataframe, you assume that all data has been downloaded and you exit the loop.当您点击一个空的 dataframe 时,您假设所有数据都已下载并退出循环。 I've implemented a for loop with 41 iterations since I know how many pages there are, but if you don't know - you can set a very high number.因为我知道有多少页,所以我已经实现了一个具有 41 次迭代的for循环,但如果你不知道 - 你可以设置一个非常高的数字。 In case you don't want "magic" numbers in your code, just use a while loop.如果您不希望代码中出现“魔术”数字,请使用 while 循环。 But be careful not to go into an enless one...但注意不要把 go 变成一个不完整的...

I've taken the liberty to alter a bit your code into a more functional approach.我冒昧地将您的代码更改为更实用的方法。 Going forward, you might want to modularize it further.展望未来,您可能希望进一步对其进行模块化。

import requests
import requests_cache
import lxml.html as lh
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen

base_url = 'https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions'

#requests_cache.install_cache(cache_name='whitmer_donor_cache', backend='sqlite', expire_after=180)


def get_page(page_url, params):
    #Scrape Table Cells
    page = requests.get(page_url, params=params)
    print(page.text)

    doc = lh.fromstring(page.content)
    tr_elements = doc.xpath('//tr')
    #print([len(T) for T in tr_elements[:12]])

    #Parse Table Header
    tr_elements = doc.xpath('//tr')
    col = []
    i = 0
    for t in tr_elements[0]:
        i += 1
        name = t.text_content()
        print('%d:"%s"' % (i, name))
        col.append((name, []))
    # print(col)

    ###Create Pandas Dataframe###
    for j in range(1, len(tr_elements)):
        T = tr_elements[j]
        if len(T) != 9:
            break
        i = 0
        for t in T.iterchildren():
            data = t.text_content().strip()
            if i > 0:
                try:
                    data = int(data)
                except:
                    pass
            col[i][1].append(data)
            i += 1
        # print(col[0:3])
    #print([len(C) for (title,C) in col])

    ###Format Dataframe###
    Dict = {el[0]: el[1] for el in col}
    Dict = {title: column for (title, column) in col}
    print(col[1])
    print(Dict.keys())
    df = pd.DataFrame(Dict)
    df = df.replace('\n', '', regex=True)
    df = df.replace('  ', ' ', regex=True)
    df['Receiving Committee'] = df['Receiving Committee'].apply(
        lambda x: x.strip().capitalize())

    ###Print Dataframe###
    with pd.option_context('display.max_rows', 10, 'display.max_columns',
                           10):  # more options can be specified also
        print(df)

    return df


def get_all_pages(base_url):
    df_list = []
    for i in range(1, 42):
        params = {"schedule": "1A", "changes": '0', "page": str(i)}
        df = get_page(base_url, params)
        print(df)
        if df.empty:
            print("Empty dataframe! All done.")
            break
        df_list.append(df)
        print(df)
        print('====================================')
    return df_list


df_list = get_all_pages(base_url)
pd.concat(df_list).to_csv('Whitmer_Donors.csv', mode='w', header=False)

#create excel writer
#writer = pd.ExcelWriter("Whitmer_Donors.xlsx")

#write dataframe to excel#
#df.to_excel(writer)
#writer.save()
print("Dataframe is written successfully to excel")

Here's a slightly different implementation.这是一个略有不同的实现。 Use read_html() to get the table directly to pandas, and then use soup to find the next page.使用read_html()直接获取表到pandas,然后使用soup查找下一页。 If there's not a next page, the program will exit.如果没有下一页,程序将退出。 This page you are scraping has 40 pages, so start at 38 for example and it'll quit and print the df with 300 rows.您正在抓取的此页面有 40 页,因此例如从 38 页开始,它将退出并打印 df 300 行。 Any mods to the dataframe you can do at the end.您可以在最后对 dataframe 进行任何修改。

# this function looks for the next page url; returns None if it isn't there
def parse(soup):   
    try:
        return json.loads(soup.find('search-results').get(':pagination'))['next_page_url']
    except:
        return None


start_urls = ['https://cfrsearch.nictusa.com/documents/473261/details/filing/contributions?schedule=1A&changes=0&page=38'] # change to 1 for the full run
df_hold_list = [] # collect your dataframes to concat later

for url in start_urls: # you can iterate through different urls or just the one
    page = requests.get(url)
    soup = BeautifulSoup(page.text, "html.parser")
    df = pd.read_html(url)[0]
    df_hold_list.append(df)
    next = parse(soup)

    while True:
        if next:
            print(next)
            page = requests.get(next)
            soup = BeautifulSoup(page.text, "html.parser")
            df = pd.read_html(url)[0]
            df_hold_list.append(df)
            next = parse(soup)
        else:
            break

df_final = pd.concat(df_hold_list)
df_final.shape

(300, 9) # 300 rows, 9 columns

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM