简体   繁体   中英

Loop Scraping Multiple Pages Using Python and BS4

I'm aa student journalist and am new to python. I've been trying to figure out how to scrape each individual crime log on all the current pages of my university's daily crime log using a for loop. However, it is only scraping the first page. I've been looking at other people's codes and questions and couldn't really figure out what I was missing. Any help is appreciated thanks.

 import urllib.request import requests import csv import bs4 import numpy as np import pandas as pd from pandas import DataFrame for num in range(27): #Number of pagers plus url = ("http://police.psu.edu/daily-crime-log?field_reported_value[value]&page=0".format(num)) r = requests.get(url) source = urllib.request.urlopen(url).read() bs_tree = bs4.BeautifulSoup(source, "lxml") incident_nums = bs_tree.findAll("div", class_="views-field views-field-title") occurred = bs_tree.findAll("div", class_="views-field views-field-field-occurred") reported = bs_tree.findAll("div", class_="views-field views-field-field-reported") incidents = bs_tree.findAll("div", class_="views-field views-field-field-nature-of-incident") offenses = bs_tree.findAll("div", class_="views-field views-field-field-offenses") locations = bs_tree.findAll("div", class_="views-field views-field-field-location") dispositions = bs_tree.findAll("div", class_="views-field views-field-field-case-disposition") allCrimes = pd.DataFrame(columns = ['Incident#', 'Occurred', 'reported', 'nature of incident', 'offenses', 'location', 'disposition']) total = len(incident_nums) count = 0 while (count<total): incNum = incident_nums[count].find("span", class_="field-content").get_text() occr = occurred[count].find("span", class_="field-content").get_text() repo = reported[count].find("span", class_="field-content").get_text() incNat = incidents[count].find("span", class_="field-content").get_text() offe = offenses[count].find("span", class_="field-content").get_text() loca = locations[count].find("span", class_="field-content").get_text() disp = dispositions[count].find("span", class_="field-content").get_text() allCrimes.loc[count] =[incNum, occr, repo, incNat, offe, loca, disp] count +=1 

Following others' examples isn't necessarily bad practice but you need to check that stuff works as you add it, at least until you gain confidence.

For instance, if you try running this for-loop on its own ...

>>> for num in ('29'):
...     num
...     
'2'
'9'

you see that Python substitutes a '2' in num then a '9'. Not what you wanted.

If I follow your lead, having examining that site, I see that pages 0 through 26 exist. I can code, for num in range(27) . A zero initial value is understood, the loop goes to one less that the value I gave. In the statement where you request the URL you will need to convert this integer value to a string value (formatting).

You go through the loop multiple times without keeping anything! If you want other statements to be executed as you go round the loop then you need to indent them (or maybe this happened when you submitted your code).

After this I'm not clear what you're doing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM