簡體   English   中英

如何使用python和beautifulsoup4循環抓取網站中多個頁面的數據

[英]How can I loop scraping data for multiple pages in a website using python and beautifulsoup4

我試圖從 PGA.com 網站上抓取數據以獲取美國所有高爾夫球場的表格。 在我的 CSV 表中,我想包括高爾夫球場的名稱、地址、所有權、網站、電話號碼。 有了這些數據,我想對其進行地理編碼並將其放入地圖並在我的計算機上有一個本地副本

我使用 Python 和 Beautiful Soup4 來提取我的數據。 我已經盡可能地提取數據並將其導入到 CSV 文件中,但是我現在遇到了從 PGA 網站上的多個頁面抓取數據的問題。 我想提取所有高爾夫球場,但我的腳本僅限於一頁,我想將其循環播放,以便它可以從 PGA 站點中找到的所有頁面中捕獲高爾夫球場的所有數據。 大約有 18000 個黃金課程和 900 頁捕獲數據

下面附上我的腳本。 我需要有關創建代碼的幫助,這些代碼將從 PGA 網站捕獲所有數據,而不僅僅是一個站點,而是多個站點。 通過這種方式,它將為我提供美國黃金課程的所有數據。

這是我的腳本如下:

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

r = requests.get(url)

soup = BeautifulSoup(r.content)

g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('filename5.csv','wb') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)    

#for item in g_data1:
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
     #except:
          #pass  
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
     #except:
          #pass

#for item in g_data2:
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
   #except:
      #pass

該腳本一次僅捕獲 20 個,我想在一個腳本中捕獲所有內容,該腳本占 18000 個高爾夫球場和 900 頁以抓取表單。

PGA網站的搜索有多個頁面,url遵循以下模式:

http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here

這意味着您可以讀取頁面的內容,然后將 page 的值更改為 1,然后讀取下一頁......等等。

import csv
import requests 
from bs4 import BeautifulSoup
for i in range(907):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    # Your code for each individual page here 

如果你還在閱讀這篇文章,你也可以試試這個代碼......

from urllib.request import urlopen
from bs4 import BeautifulSoup

file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
    html = urlopen(url)
    soup = BeautifulSoup(html,"html.parser")
    Title = soup.find_all("div", {"class":"views-field-nothing"})
    for i in Title:
        try:
            name = i.find("div", {"class":"views-field-title"}).get_text()
            address = i.find("div", {"class":"views-field-address"}).get_text()
            city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
            phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
            website = i.find("div", {"class":"views-field-website"}).get_text()
            print(name, address, city, phone, website)
            f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
        except: AttributeError
f.close()

寫入 range(1,5) 的地方只需將其更改為 0, 到最后一頁,您將獲得 CSV 中的所有詳細信息,我非常努力地以正確的格式獲取您的數據,但這很難:)。

您將鏈接指向單個頁面,它不會自行遍歷每個頁面。

第 1 頁:

url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

第2頁:

http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

第 907 頁: http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 ://www.pga.com/golf-courses/search http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 page=906&searchbox http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 Course%20Name&searchbox_zip http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 ZIP&distance http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 price_range http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 course_type http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0 both&has_events http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

由於您運行的是第 1 頁,因此您只會得到 20。您需要創建一個循環來遍歷每個頁面。

您可以首先創建一個執行一頁的函數,然后迭代該函數。

search?之后就search? 在 url 中,從第 2 page=1開始, page=1開始增加,直到它是page=906第 907 page=906

我注意到第一個解決方案重復了第一個實例,這是因為 0 頁和 1 頁是同一頁。 這是通過在 range 函數中指定起始頁來解決的。 下面的例子...

     for i in range(1, 907):     #Number of pages plus one
        url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "html5lib")   #Can use whichever parser you prefer

# Your code for each individual page here 

有同樣的問題,上面的解決方案不起作用。 我通過考慮 cookie 解決了我的問題。 請求會話有幫助。 創建一個會話,它會通過將 cookie 插入所有編號的頁面來拉出您需要的所有頁面。

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

s = requests.Session()
r = s.get(url)

PGA 網站已更改此問題已被詢問。

似乎他們通過以下方式組織所有課程:州 > 城市 > 課程

鑒於這種變化和這個問題的流行,今天我將如何解決這個問題。

第 1 步 - 導入我們需要的所有內容:

import time
import random
from gazpacho import Soup   # https://github.com/maxhumber/gazpacho
from tqdm import tqdm       # to keep track of progress

第 2 步 - 抓取所有狀態 URL 端點:

URL = "https://www.pga.com"

def get_state_urls():
    soup = Soup.get(URL + "/play")
    a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
    state_urls = [URL + a.attrs['href'] for a in a_tags]
    return state_urls

state_urls = get_state_urls()

第 3 步 - 編寫一個函數來抓取所有城市鏈接:

def get_state_cities(state_url):
    soup = Soup.get(state_url)
    a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
    state_cities = [URL + a.attrs['href'] for a in a_tags]
    return state_cities

state_url = state_urls[0]
city_links = get_state_cities(state_url)

第 4 步 - 編寫一個函數來抓取所有課程:

def get_courses(city_link):
    soup = Soup.get(city_link)
    courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
    return courses

city_link = city_links[0]
courses = get_courses(city_link)

第 5 步 - 編寫一個函數來解析有關課程的所有有用信息:


def parse_course(course):
    return {
        "name": course.find("h5", mode="first").text,
        "address": course.find("div", {'class': "jss332"}, mode="first").strip(),
        "url": course.find("a", mode="first").attrs["href"]
    }

course = courses[0]
parse_course(course)

第 6 步 - 遍歷所有內容並保存:

all_courses = []
for state_url in tqdm(state_urls):
    city_links = get_state_cities(state_url)
    time.sleep(random.uniform(1, 10) / 10)
    for city_link in city_links:
        courses = get_courses(city_link)
        time.sleep(random.uniform(1, 10) / 10)
        for course in courses:
            info = parse_course(course)
            all_courses.append(info)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM