简体   繁体   中英

How can I loop scraping data for multiple pages in a website using python and beautifulsoup4

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer

I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website. I want to extract ALL THE GOLF COURSES but my script is limited only to one page I want to loop it in away that it will capture all data for golf courses from all pages found in the PGA site. There are about 18000 gold courses and 900 pages to capture data

Attached below is my script. I need help on creating code that will capture ALL data from the PGA website and not just one site but multiple. In this manner it will provide me with all the data of gold courses in the United States.

Here is my script below:

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

r = requests.get(url)

soup = BeautifulSoup(r.content)

g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('filename5.csv','wb') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)    

#for item in g_data1:
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
     #except:
          #pass  
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
     #except:
          #pass

#for item in g_data2:
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
   #except:
      #pass

This script only captures 20 at a time and I want to capture all in one script which account for 18000 golf courses and 900 pages to scrape form.

The PGA website's search have multiple pages, the url follows the pattern:

http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here

this means you can read the content of the page, then change the value of page by 1, and read the the next page.... and so on.

import csv
import requests 
from bs4 import BeautifulSoup
for i in range(907):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    # Your code for each individual page here 

if you still read this post , you can try this code too....

from urllib.request import urlopen
from bs4 import BeautifulSoup

file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
    html = urlopen(url)
    soup = BeautifulSoup(html,"html.parser")
    Title = soup.find_all("div", {"class":"views-field-nothing"})
    for i in Title:
        try:
            name = i.find("div", {"class":"views-field-title"}).get_text()
            address = i.find("div", {"class":"views-field-address"}).get_text()
            city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
            phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
            website = i.find("div", {"class":"views-field-website"}).get_text()
            print(name, address, city, phone, website)
            f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
        except: AttributeError
f.close()

where it is written range(1,5) just change that with 0,to the last page , and you will get all details in CSV, i tried very hard to get your data in proper format but it's hard:).

You're putting a link to a single page, it's not going to iterate through each one on its own.

Page 1:

url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

Page 2:

http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

Page 907: http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

Since you're running for page 1 you'll only get 20. You'll need to create a loop that'll run through each page.

You can start off by creating a function that does one page then iterate that function.

Right after the search? in the url, starting at page 2, page=1 begins increasing until page 907 where it's page=906 .

I noticed that the first solution had a repetition of the first instance, that is because the 0 page and 1 page is the same page. This is resolved by specifying the start page in the range function. Example below...

     for i in range(1, 907):     #Number of pages plus one
        url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "html5lib")   #Can use whichever parser you prefer

# Your code for each individual page here 

Had this same exact problem and the solutions above did not work. I solved mine by accounting for cookies. A requests session helps. Create a session and it'll pull all the pages you need by inserting a cookie to all the numbered pages.

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

s = requests.Session()
r = s.get(url)

The PGA website has changed this question has been asked.

It seems they organize all courses by: State > City > Course

In light of this change and the popularity of this question, here's how I'd solve this problem today.

Step 1 - Import everything we'll need:

import time
import random
from gazpacho import Soup   # https://github.com/maxhumber/gazpacho
from tqdm import tqdm       # to keep track of progress

Step 2 - Scrape all the state URL endpoints:

URL = "https://www.pga.com"

def get_state_urls():
    soup = Soup.get(URL + "/play")
    a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
    state_urls = [URL + a.attrs['href'] for a in a_tags]
    return state_urls

state_urls = get_state_urls()

Step 3 - Write a function to scrape all the city links:

def get_state_cities(state_url):
    soup = Soup.get(state_url)
    a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
    state_cities = [URL + a.attrs['href'] for a in a_tags]
    return state_cities

state_url = state_urls[0]
city_links = get_state_cities(state_url)

Step 4 - Write a function to scrape all of the courses:

def get_courses(city_link):
    soup = Soup.get(city_link)
    courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
    return courses

city_link = city_links[0]
courses = get_courses(city_link)

Step 5 - Write a function to parse all the useful info about a course:


def parse_course(course):
    return {
        "name": course.find("h5", mode="first").text,
        "address": course.find("div", {'class': "jss332"}, mode="first").strip(),
        "url": course.find("a", mode="first").attrs["href"]
    }

course = courses[0]
parse_course(course)

Step 6 - Loop through everything and save:

all_courses = []
for state_url in tqdm(state_urls):
    city_links = get_state_cities(state_url)
    time.sleep(random.uniform(1, 10) / 10)
    for city_link in city_links:
        courses = get_courses(city_link)
        time.sleep(random.uniform(1, 10) / 10)
        for course in courses:
            info = parse_course(course)
            all_courses.append(info)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM