简体   繁体   English

循环不适用于使用python和beautifulsoup4抓取数据

[英]Loop not working for scraping data using python and beautifulsoup4

My goal is to scrape data from the PGA website to extract all the golf course locations in the USA. 我的目标是从PGA网站上收集数据,以提取美国所有高尔夫球场的位置。 I aim to scrape from the 907 pages the name, address, ownership, phone number, and website. 我的目标是从907页中刮掉名称,地址,所有权,电话号码和网站。

I have created the script below but when the CSV is created it produces errors. 我已经在下面创建了脚本,但是在创建CSV时会产生错误。 The CSV file created from the script has data repetitions of the first few pages and the pages of the website. 通过脚本创建的CSV文件具有前几个页面和网站页面的数据重复。 It does not give the whole data of the 907 pages. 它没有提供907页的全部数据。

How can I fix my script so that it will scrape all 907 pages and produce a CSV with all the golf courses listed on the PGA website? 如何解决我的脚本,以使它刮掉全部907页并生成PGA网站上列出的所有高尔夫球场的CSV?

Below is my script: 下面是我的脚本:

import csv
import requests 
from bs4 import BeautifulSoup

for i in range(907):      # Number of pages plus one 
     url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
     r = requests.get(url)
     soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)

Her is the code that you want. 她是您想要的代码。 It will first parse the current page before going on to the next one. 进入下一页之前 ,它将首先解析当前页面。 (There are some blank rows, I hope you can fix that yourself). (有一些空白行,希望您自己修复)。

import csv
import requests 
from bs4 import BeautifulSoup


def encode(l):
    out = []
    for i in l:
        text = str(i).encode('utf-8')
        out.append(''.join([i if ord(i) < 128 else ' ' for i in text])) #taken from Martjin Pieter's answer 
        # http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space/20078869#20078869
    return out

courses_list = []
for i in range(5):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    g_data2=soup.find_all("div",{"class":"views-field-nothing"})

    for item in g_data2:
        try:
              name = item.contents[1].find_all("div",{"class":"views-field-title"})[0].text

        except:
              name=''
        try:
              address1= item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
        except:
              address1=''
        try:
              address2= item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
        except:
              address2=''
        try:
              website= item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
        except:
              website=''   
        try:
              Phonenumber= item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
        except:
              Phonenumber=''      

        course=[name,address1,address2,website,Phonenumber]

        courses_list.append(encode(course))


with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)

EDIT : After the inevitable problems of unicode encoding/decoding, I have modified the answer and it will (hopefully) work now. 编辑 :不可避免的unicode编码/解码问题之后,我已经修改了答案,并且(希望)现在可以正常工作。 But http://nedbatchelder.com/text/unipain.html see this. 但是http://nedbatchelder.com/text/unipain.html看到了这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM