简体   繁体   English

如何使用python和beautifulsoup4循环抓取网站中多个页面的数据

[英]How can I loop scraping data for multiple pages in a website using python and beautifulsoup4

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States.我试图从 PGA.com 网站上抓取数据以获取美国所有高尔夫球场的表格。 In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number.在我的 CSV 表中,我想包括高尔夫球场的名称、地址、所有权、网站、电话号码。 With this data I would like to geocode it and place into a map and have a local copy on my computer有了这些数据,我想对其进行地理编码并将其放入地图并在我的计算机上有一个本地副本

I utilized Python and Beautiful Soup4 to extract my data.我使用 Python 和 Beautiful Soup4 来提取我的数据。 I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website.我已经尽可能地提取数据并将其导入到 CSV 文件中,但是我现在遇到了从 PGA 网站上的多个页面抓取数据的问题。 I want to extract ALL THE GOLF COURSES but my script is limited only to one page I want to loop it in away that it will capture all data for golf courses from all pages found in the PGA site.我想提取所有高尔夫球场,但我的脚本仅限于一页,我想将其循环播放,以便它可以从 PGA 站点中找到的所有页面中捕获高尔夫球场的所有数据。 There are about 18000 gold courses and 900 pages to capture data大约有 18000 个黄金课程和 900 页捕获数据

Attached below is my script.下面附上我的脚本。 I need help on creating code that will capture ALL data from the PGA website and not just one site but multiple.我需要有关创建代码的帮助,这些代码将从 PGA 网站捕获所有数据,而不仅仅是一个站点,而是多个站点。 In this manner it will provide me with all the data of gold courses in the United States.通过这种方式,它将为我提供美国黄金课程的所有数据。

Here is my script below:这是我的脚本如下:

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

r = requests.get(url)

soup = BeautifulSoup(r.content)

g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('filename5.csv','wb') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)    

#for item in g_data1:
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
     #except:
          #pass  
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
     #except:
          #pass

#for item in g_data2:
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
   #except:
      #pass

This script only captures 20 at a time and I want to capture all in one script which account for 18000 golf courses and 900 pages to scrape form.该脚本一次仅捕获 20 个,我想在一个脚本中捕获所有内容,该脚本占 18000 个高尔夫球场和 900 页以抓取表单。

The PGA website's search have multiple pages, the url follows the pattern: PGA网站的搜索有多个页面,url遵循以下模式:

http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here

this means you can read the content of the page, then change the value of page by 1, and read the the next page.... and so on.这意味着您可以读取页面的内容,然后将 page 的值更改为 1,然后读取下一页......等等。

import csv
import requests 
from bs4 import BeautifulSoup
for i in range(907):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    # Your code for each individual page here 

if you still read this post , you can try this code too....如果你还在阅读这篇文章,你也可以试试这个代码......

from urllib.request import urlopen
from bs4 import BeautifulSoup

file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
    html = urlopen(url)
    soup = BeautifulSoup(html,"html.parser")
    Title = soup.find_all("div", {"class":"views-field-nothing"})
    for i in Title:
        try:
            name = i.find("div", {"class":"views-field-title"}).get_text()
            address = i.find("div", {"class":"views-field-address"}).get_text()
            city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
            phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
            website = i.find("div", {"class":"views-field-website"}).get_text()
            print(name, address, city, phone, website)
            f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
        except: AttributeError
f.close()

where it is written range(1,5) just change that with 0,to the last page , and you will get all details in CSV, i tried very hard to get your data in proper format but it's hard:).写入 range(1,5) 的地方只需将其更改为 0, 到最后一页,您将获得 CSV 中的所有详细信息,我非常努力地以正确的格式获取您的数据,但这很难:)。

You're putting a link to a single page, it's not going to iterate through each one on its own.您将链接指向单个页面,它不会自行遍历每个页面。

Page 1:第 1 页:

url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

Page 2:第2页:

http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

Page 907: http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0第 907 页: http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 ://www.pga.com/golf-courses/search http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 page=906&searchbox http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 Course%20Name&searchbox_zip http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 ZIP&distance http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 price_range http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 course_type http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 both&has_events http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

Since you're running for page 1 you'll only get 20. You'll need to create a loop that'll run through each page.由于您运行的是第 1 页,因此您只会得到 20。您需要创建一个循环来遍历每个页面。

You can start off by creating a function that does one page then iterate that function.您可以首先创建一个执行一页的函数,然后迭代该函数。

Right after the search? search?之后就search? in the url, starting at page 2, page=1 begins increasing until page 907 where it's page=906 .在 url 中,从第 2 page=1开始, page=1开始增加,直到它是page=906第 907 page=906

I noticed that the first solution had a repetition of the first instance, that is because the 0 page and 1 page is the same page.我注意到第一个解决方案重复了第一个实例,这是因为 0 页和 1 页是同一页。 This is resolved by specifying the start page in the range function.这是通过在 range 函数中指定起始页来解决的。 Example below...下面的例子...

     for i in range(1, 907):     #Number of pages plus one
        url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "html5lib")   #Can use whichever parser you prefer

# Your code for each individual page here 

Had this same exact problem and the solutions above did not work.有同样的问题,上面的解决方案不起作用。 I solved mine by accounting for cookies.我通过考虑 cookie 解决了我的问题。 A requests session helps.请求会话有帮助。 Create a session and it'll pull all the pages you need by inserting a cookie to all the numbered pages.创建一个会话,它会通过将 cookie 插入所有编号的页面来拉出您需要的所有页面。

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

s = requests.Session()
r = s.get(url)

The PGA website has changed this question has been asked. PGA 网站已更改此问题已被询问。

It seems they organize all courses by: State > City > Course似乎他们通过以下方式组织所有课程:州 > 城市 > 课程

In light of this change and the popularity of this question, here's how I'd solve this problem today.鉴于这种变化和这个问题的流行,今天我将如何解决这个问题。

Step 1 - Import everything we'll need:第 1 步 - 导入我们需要的所有内容:

import time
import random
from gazpacho import Soup   # https://github.com/maxhumber/gazpacho
from tqdm import tqdm       # to keep track of progress

Step 2 - Scrape all the state URL endpoints:第 2 步 - 抓取所有状态 URL 端点:

URL = "https://www.pga.com"

def get_state_urls():
    soup = Soup.get(URL + "/play")
    a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
    state_urls = [URL + a.attrs['href'] for a in a_tags]
    return state_urls

state_urls = get_state_urls()

Step 3 - Write a function to scrape all the city links:第 3 步 - 编写一个函数来抓取所有城市链接:

def get_state_cities(state_url):
    soup = Soup.get(state_url)
    a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
    state_cities = [URL + a.attrs['href'] for a in a_tags]
    return state_cities

state_url = state_urls[0]
city_links = get_state_cities(state_url)

Step 4 - Write a function to scrape all of the courses:第 4 步 - 编写一个函数来抓取所有课程:

def get_courses(city_link):
    soup = Soup.get(city_link)
    courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
    return courses

city_link = city_links[0]
courses = get_courses(city_link)

Step 5 - Write a function to parse all the useful info about a course:第 5 步 - 编写一个函数来解析有关课程的所有有用信息:


def parse_course(course):
    return {
        "name": course.find("h5", mode="first").text,
        "address": course.find("div", {'class': "jss332"}, mode="first").strip(),
        "url": course.find("a", mode="first").attrs["href"]
    }

course = courses[0]
parse_course(course)

Step 6 - Loop through everything and save:第 6 步 - 遍历所有内容并保存:

all_courses = []
for state_url in tqdm(state_urls):
    city_links = get_state_cities(state_url)
    time.sleep(random.uniform(1, 10) / 10)
    for city_link in city_links:
        courses = get_courses(city_link)
        time.sleep(random.uniform(1, 10) / 10)
        for course in courses:
            info = parse_course(course)
            all_courses.append(info)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM