My goal is to scrape data from the PGA website to extract all the golf course locations in the USA. I aim to scrape from the 907 pages the name, address, ownership, phone number, and website.
I have created the script below but when the CSV is created it produces errors. The CSV file created from the script has data repetitions of the first few pages and the pages of the website. It does not give the whole data of the 907 pages.
How can I fix my script so that it will scrape all 907 pages and produce a CSV with all the golf courses listed on the PGA website?
Below is my script:
import csv
import requests
from bs4 import BeautifulSoup
for i in range(907): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(course)
with open ('PGA_Data.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
Her is the code that you want. It will first parse the current page before going on to the next one. (There are some blank rows, I hope you can fix that yourself).
import csv
import requests
from bs4 import BeautifulSoup
def encode(l):
out = []
for i in l:
text = str(i).encode('utf-8')
out.append(''.join([i if ord(i) < 128 else ' ' for i in text])) #taken from Martjin Pieter's answer
# http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space/20078869#20078869
return out
courses_list = []
for i in range(5): # Number of pages plus one
url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
for item in g_data2:
try:
name = item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1= item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2= item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
try:
website= item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
except:
website=''
try:
Phonenumber= item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
except:
Phonenumber=''
course=[name,address1,address2,website,Phonenumber]
courses_list.append(encode(course))
with open ('PGA_Data.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow(row)
EDIT : After the inevitable problems of unicode encoding/decoding, I have modified the answer and it will (hopefully) work now. But http://nedbatchelder.com/text/unipain.html see this.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.