简体   繁体   中英

How can i loop through multiple pages to scrape table data (python)

Im struggling to find a way to loop through pages and scrape data from a table - i've managed to get the data from the first page, but i dont know how to proceed with going through each page and getting the data. Ive tried various different bits of code but im unable to get anything to work. The site im trying to scrape adds &pageno=2 to the end of the url and next buttons (rather than numbered buttons) - any help would be great.

current code for scraping the first page successfully is as follows:

from cgitb import text
import requests
import pprint
import csv
from bs4 import BeautifulSoup
from lxml import html

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}

url = 'https://www.revcomps.com/past-entry-lists/?draw_chosen=2693823'

r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.text, 'html.parser')

table = soup.find('table', {'class':'ticket_results'})
data = [td.text for td in table.find_all('td')]

for table in soup.find_all('table', {'class':'ticket_results'}):
    data = [td.text for td in table.find_all('td')]
    pprint.pprint(data)

You can just add your requests into a loop for the page number. A Python f string can be used to add the page variable into the URL:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}

for page in range(1, 3):
    print(f"Page {page}")
    url = f'https://www.revcomps.com/past-entry-lists/?draw_chosen=2693823&pageno={page};'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

    for table in soup.find_all('table', {'class':'ticket_results'}):
        data = [td.text for td in table.find_all('td')]
        print(data)

Giving you output starting:

Page 1
['Order', 'WINNING TICKET', 'Ticket', '2694340', 'Andrew Reynolds', '694', '2699224', 'Martin Lilge', '315', '2703986', 'Ricky Parton', '975']
['Order', 'Customer', 'Ticket', '2704184', 'Philip Stoyles', '001', '2700874', 'Timothy Powell', '002', '2696801', 'Steven Hill', '003', '2696301', 'Trevor Larken', '004', '2696387', 'george malone', '005', '2701735', 'Williams jonathan', '006', '2704193', 'Michael Worthington', '007', '2695573', 'Mike Bates', '008', '2695170', 'Debbie Gent', '009', '2699892', 'Edward Buffett', '010', '2694080', 'David Miller', '011', '2701554', 'Liz Coates', '012', '2694944', 'Amanda Demellweek', '013', '2695128', 'John Crowe', '014', '2698092', 'Jamie Houston', '015', '2703986', 'Ricky Parton', '016', '2700944', 'Tom Chant', '017', '2698687', 'Gary Young', '018', '2696026', 'Tritean Emanuel', '019', '2704117', 'Stephen Melekeowei', '020', '2700379', 'Darren Pearson', '021', '2696357', 'Kane Nicholas', '022', '2704062', 'Jessie Nellany', '023', '2700621', 'Nick Hart', '024', '2704879', 'Chris Maynard', '025', '2703091', 'Nils Omell', '026', '2702854', 'mr stephen elsley', '027', '2698997', 'Mark Skedgel-hill', '028', '2701558', 'Bradley King', '029', '2698372', 'simon miles', '030', '2694701', 'Gillian Chisnall', '031', '2701365', 'Sarah Tingle-kitchen', '032', '2694591', 'Robert Townsend', '033', '2695077', 'Glen Davies', '034', '2695177', 'Wayne Cummings', '035', '2701899', 'Ross Hay', '036', '2703464', 'Shaun Raynsford', '037', '2704149', 'K Oszlanczi', '038', '2703566', 'Daniel Fuller', '039', '2699263', 'Adam Torok', '040', '2700621', 'Nick Hart', '041', '2703279', 'Omar Shawesh', '042', '2699452', 'Mark Widger', '043', '2695848', 'Zoey Longley', '044', '2703704', 'Daniel Hyndman', '045', '2696997', 'Paul Daniel', '046', '2694506', 'Mark Thompson', '047', '2699460', 'Martin Buckingham', '048', '2695186', 'Matt Beavis', '049', '2701503', 'Craig Driscoll', '050', '2699318', 'Alan Parker', '051', '2699729', 'Stephen Minnikin', '052', '2695573', 'Mike Bates', '053', '2698438', 'Andrew Kosinski', '054', '2698679', 'Carly Mason', '055', '2702121', 'Mark Adams', '056', '2698613', 'Neil Gunn', '057', '2704149', 'K Oszlanczi', '058', '2699109', 'Steve Bowen', '059', '2702108', 'Thomas Martin', '060', '2696482', 'mr stephen elsley', '061', '2696813', 'Nigel Scott', '062', '2701394', 'Chris Brown', '063', '2698459', 'Gordon Bickerton', '064', '2700546', 'jon tribbeck', '065', '2702492', 'Mark Bentley', '066', '2704155', 'Ryan Stephens', '067', '2694831', 'David Godfrey', '068', '2695671', 'Lee Smith', '069', '2695066', 'Kristian Howells', '070', '2694225', 'Simon Costello', '071', '2695186', 'Matt Beavis', '072', '2699947', 'Anthony Abbey', '073', '2701845', 'Paul Quarterman', '074', '2695573', 'Mike Bates', '075', '2701618', 'Adam Kimber', '076', '2704433', 'John Hayes', '077', '2699484', 'Jamie Brookes', '078', '2695587', 'Richard Hurst', '079', '2696301', 'Trevor Larken', '080', '2698200', 'Ewa Krzyszkowska', '081', '2698023', 'Jason Reed', '082', '2702455', 'Simon Harrington', '083', '2694869', 'Mike Bates', '084', '2703644', 'Jason Gow', '085', '2700989', 'A J Freeman', '086', '2696784', 'Adam Timberlake', '087', '2701447', 'Lewis Middleton', '088', '2701236', 'Scot Beall', '089', '2695477', 'Simon Farrow', '090', '2697197', 'Marcus du Preez', '091', '2697115', 'Roderick Evans', '092', '2700621', 'Nick Hart', '093', '2701231', 'Norma Matheson', '094', '2695587', 'Richard Hurst', '095', '2702017', 'Michael Richardson', '096', '2703702', 'Dean Brain', '097', '2699907', 'Lee Murray', '098', '2694583', 'Kasia Krzyzak', '099', '2700048', 'terri palmer', '100', '2699499', 'Simon Hack', '101', '2694206', 'Graeme Allister', '102', '2700158', 'Melissa Dedman', '103', '2699262', 'Romans Zolovs', '104', '2694125', 'Jonathan Byrne', '105', '2702812', 'Nicola McLaughlin', '106', '2704152', 'Howard Pearson', '107', '2696432', 'Zac Sirrell', '108', '2696474', "Luke Davies-O'Grady", '109', '2699367', 'Charles Mulinder', '110', '2701365', 'Sarah Tingle-kitchen', '111', '2703659', 'Andrew Fenton', '112', '2695167', 'Roy heer', '113', '2698200', 'Ewa Krzyszkowska', '114', '2697494', 'Steve Nightingale', '115', '2698916', 'Dale Hodges', '116', '2695502', 'G G hearn', '117', '2699776', 'Antiny Swift', '118', '2704778', 'MARK SHAKESBY', '119', '2698200', 'Ewa Krzyszkowska', '120', '2694027', 'Paul Otway', '121', '2700621', 'Nick Hart', '122', '2695847', 'Gavin Holmes', '123', '2699915', 'Torquil Stupart', '124', '2703807', 'Andrew Telfer', '125', '2699931', 'Lloyd Reed', '126', '2700991', 'Clare Brown', '127', '2699914', 'Luke Twivey', '128', '2699308', 'MIKE SPENCER', '129', '2698885', 'dave wills', '130', '2695933', 'Regan Thacker', '131', '2696301', 'Trevor Larken', '132', '2698960', 'Adam Hamada', '133', '2699566', 'Action Fighter', '134', '2703704', 'Daniel Hyndman', '135', '2702652', 'Sarah Brooke', '136', '2694305', 'Scott Knowles', '137', '2700635', 'Jasen Swann', '138', '2696301', 'Trevor Larken', '139', '2694831', 'David Godfrey', '140', '2694174', 'Silviu Dan', '141', '2704446', 'Alan Ball', '142', '2699026', 'Adam Gillett', '143', '2699916', 'Dillon Graham', '144', '2698613', 'Neil Gunn', '145', '2697494', 'Steve Nightingale', '146', '2696380', 'Danny James Pearson', '147', '2700010', 'Peter Ede-Morley', '148', '2704731', 'Simon Wise', '149', '2694056', 'Joel Binns', '150']
Page 2
['Order', 'WINNING TICKET', 'Ticket', '2694340', 'Andrew Reynolds', '694', '2699224', 'Martin Lilge', '315', '2703986', 'Ricky Parton', '975']
['Order', 'Customer', 'Ticket', '2694305', 'Scott Knowles', '151', '2694171', 'Mariusz Karczewski', '152', '2704983', 'Jonathan Hill', '153', '2696473', 'claudia stefanoaia', '154', '2694111', 'David Robinson', '155', '2696301', 'Trevor Larken', '156', '2696270', 'Stuart Bowater', '157', '2699819', 'Ben Funnell', '158', '2703237', 'Mark Lund', '159', '2702804', 'Iain Wallace', '160', '2694206', 'Graeme Allister', '161', '2703060', 'Mark Maskell', '162', '2699308', 'MIKE SPENCER', '163', '2700589', 'Aidan McGilligan', '164', '2698428', 'Benjamin Melsome', '165', '2701686', 'Mariusz Karczewski', '166', '2694121', 'Joseph Woodard', '167', '2700989', 'A J Freeman', '168', '2699109', 'Steve Bowen', '169', '2704382', 'Keith Groundwater', '170', '2700144', 'Carl Marshall', '171', '2698017', 'Geoff Hall', '172', '2704941', 'Graham Riley', '173', '2697494', 'Steve Nightingale', '174', '2697796', 'Gary Leech', '175', '2699229', 'Karl Anson', '176', '2702100', 'Gary Plaskett', '177', '2694826', 'Rayminther Singh', '178', '2702394', 'Rebecca Smith', '179', '2694149', 'Martin Yates', '180', '2700860', 'Katie West', '181', '2695412', 'Daniel Payne', '182', '2695412', 'Daniel Payne', '183', '2699052', 'Ryan Stephens', '184', '2699136', 'Kevin Oliver', '185', '2696124', 'Lee Beesley', '186', '2695997', 'Matthew Prowse', '187', '2704493', 'Mrs P E Cranwell-Hayes', '188', '2701735', 'Williams jonathan', '189', '2699013', 'Charley Isaacs', '190', '2696452', 'Caroline Calver', '191', '2703014', 'Ryan Stephens', '192', '2699776', 'Antiny Swift', '193', '2694206', 'Graeme Allister', '194', '2702649', 'Jason Mcknight', '195', '2701415', 'Daniella Murphy', '196', '2694225', 'Simon Costello', '197', '2702685', 'Chris Firth', '198', '2701445', 'Ashlyn Adams', '199', '2694305', 'Scott Knowles', '200', '2694305', 'Scott Knowles', '201', '2695587', 'Richard Hurst', '202', '2694992', 'Dave Tomley', '203', '2694296', 'Rob Thornton', '204', '2699275', 'barry venn', '205', '2701234', 'Ben Cassidy', '206', '2699460', 'Martin Buckingham', '207', '2697494', 'Steve Nightingale', '208', '2694206', 'Graeme Allister', '209', '2697361', 'Nathan Bambury', '210', '2703464', 'Shaun Raynsford', '211', '2694471', 'lewis ballantyne', '212', '2694831', 'David Godfrey', '213', '2699627', 'Ross Fulton', '214', '2700449', 'Josh Hill', '215', '2695609', 'Will Badman', '216', '2698885', 'dave wills', '217', '2700989', 'A J Freeman', '218', '2694953', 'Mark Thomas', '219', '2700184', 'steven bennetts', '220', '2699109', 'Steve Bowen', '221', '2694305', 'Scott Knowles', '222', '2701572', 'Ethne Gambrill-Jarman', '223', '2694944', 'Amanda Demellweek', '224', '2698549', 'Mr R Bennett', '225', '2704463', 'Chris Beckett', '226', '2694608', 'Ryan Stephens', '227', '2700637', 'Andrew Mckimm', '228', '2694346', 'Will Stanyard', '229', '2699109', 'Steve Bowen', '230', '2701735', 'Williams jonathan', '231', '2701554', 'Liz Coates', '232', '2694818', 'Matt Dawe', '233', '2694372', 'Richard Lindsay', '234', '2699148', 'Grant Sivewright', '235', '2704556', 'Dale Warren', '236', '2694080', 'David Miller', '237', '2701266', 'Russell Miller', '238', '2694171', 'Mariusz Karczewski', '239', '2701647', 'Peter Renshaw', '240', '2699252', 'Nicola Haigh', '241', '2695609', 'Will Badman', '242', '2702654', 'I Petkuns', '243', '2698634', 'Gay Pieters', '244', '2701286', 'timothy cozens', '245', '2697830', 'Kevin Teasdale', '246', '2695046', 'Dan Christian Buentipo Palos', '247', '2694304', 'Gary Faulkner', '248', '2702737', 'Michael Welch', '249', '2704123', 'Paul Mcdermott', '250', '2696161', 'Jono Carter', '251', '2695871', 'Cameron Davidson', '252', '2704384', 'Lauren Redhead', '253', '2694414', 'Elaine Hills', '254', '2700798', 'Mathew Pierce', '255', '2704839', 'Danny Irvine', '256', '2704790', 'Gary Perry', '257', '2694056', 'Joel Binns', '258', '2694346', 'Will Stanyard', '259', '2700243', 'Scott Gourlay', '260', '2694206', 'Graeme Allister', '261', '2699263', 'Adam Torok', '262', '2695077', 'Glen Davies', '263', '2699109', 'Steve Bowen', '264', '2695149', 'Martin Wheeler', '265', '2697877', 'Rob Poundall', '266', '2697906', 'Mike Finn', '267', '2698068', 'Miguel Pacheco', '268', '2701176', 'alex mitchell', '269', '2700998', 'Antonio Domingo', '270', '2697049', 'James Skinner', '271', '2701415', 'Daniella Murphy', '272', '2698886', 'Julie Neill', '273', '2696260', 'John Doody', '274', '2696301', 'Trevor Larken', '275', '2694831', 'David Godfrey', '276', '2703702', 'Dean Brain', '277', '2702017', 'Michael Richardson', '278', '2697361', 'Nathan Bambury', '279', '2699938', 'Charlotte Jukes', '280', '2695350', 'Paul Fieldhouse', '281', '2702350', 'Barry Little', '282', '2694849', 'Matthew Riddell', '283', '2695592', 'Robert Harvey', '284', '2703363', 'Mason BURKINSHAW', '285', '2698579', 'Louise Davies', '286', '2696694', 'Stewart Smith', '287', '2704522', 'Adam Gillett', '288', '2701236', 'Scot Beall', '289', '2696784', 'Adam Timberlake', '290', '2704628', 'Lee Heginbotham', '291', '2699389', 'Lucy Donovan', '292', '2702673', 'James Jackson', '293', '2700232', 'raysean wharton', '294', '2699109', 'Steve Bowen', '295', '2699451', 'Winai Mays', '296', '2702364', 'Graham Williams', '297', '2695368', 'Daniel Moore', '298', '2703678', 'Ian Smith', '299', '2694027', 'Paul Otway', '300']

You should look into what happens when the end of the table is reached and test for that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM