循環不適用於使用python和beautifulsoup4抓取數據

Question

我的目標是從PGA網站上收集數據，以提取美國所有高爾夫球場的位置。 我的目標是從907頁中刮掉名稱，地址，所有權，電話號碼和網站。

我已經在下面創建了腳本，但是在創建CSV時會產生錯誤。 通過腳本創建的CSV文件具有前幾個頁面和網站頁面的數據重復。 它沒有提供907頁的全部數據。

如何解決我的腳本，以使它刮掉全部907頁並生成PGA網站上列出的所有高爾夫球場的CSV？

下面是我的腳本：

import csv
import requests 
from bs4 import BeautifulSoup

for i in range(907):      # Number of pages plus one 
     url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
     r = requests.get(url)
     soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)

Answer 1

她是您想要的代碼。 在進入下一頁之前，它將首先解析當前頁面。 （有一些空白行，希望您自己修復）。

import csv
import requests 
from bs4 import BeautifulSoup


def encode(l):
    out = []
    for i in l:
        text = str(i).encode('utf-8')
        out.append(''.join([i if ord(i) < 128 else ' ' for i in text])) #taken from Martjin Pieter's answer 
        # http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space/20078869#20078869
    return out

courses_list = []
for i in range(5):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    g_data2=soup.find_all("div",{"class":"views-field-nothing"})

    for item in g_data2:
        try:
              name = item.contents[1].find_all("div",{"class":"views-field-title"})[0].text

        except:
              name=''
        try:
              address1= item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
        except:
              address1=''
        try:
              address2= item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
        except:
              address2=''
        try:
              website= item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
        except:
              website=''   
        try:
              Phonenumber= item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
        except:
              Phonenumber=''      

        course=[name,address1,address2,website,Phonenumber]

        courses_list.append(encode(course))


with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)

編輯：不可避免的unicode編碼/解碼問題之后，我已經修改了答案，並且（希望）現在可以正常工作。 但是http://nedbatchelder.com/text/unipain.html看到了這一點。

循環不適用於使用python和beautifulsoup4抓取數據

問題描述

1 個解決方案

解決方案1
1 已采納 2015-06-27 04:58:17

循環不適用於使用python和beautifulsoup4抓取數據

問題描述

1 個解決方案

解決方案1 1 已采納 2015-06-27 04:58:17

解決方案1
1 已采納 2015-06-27 04:58:17