简体   繁体   中英

For Loop to pass a Variable through a URL in Python

I am very new to Python and I am trying to learn on my own by doing some simple web scraping to get football stats.

I have been successful in getting the data for a single page at a time, but I have not been able to figure out how to add a loop into my code to scrape multiple pages at once (or multiple positions/years/conferences for that matter).

I have searched a fair amount on this and other websites but I can't seem to get it right.

Here's my code:

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=1&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'data-table1'})

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&#39', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

#for line in list_of_rows: print ', '.join(line)

outfile = open("./2014.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
writer.writerows(list_of_rows)

outfile.close()

Here's my attempt at adding a variable into the URL and building a loop:

import csv
import requests
from BeautifulSoup import BeautifulSoup

pagelist = ["1", "2", "3"]

x = 0
while (x < 500):
    url = "http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p="+str(x)).read(),'html'+"&d-447263-s=RUSHING_ATTEMPTS_PER_GAME_AVG&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=RUSHING&conference=null&qualified=false"

    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    table = soup.find('table', attrs={'class': 'data-table1'})
    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&#39', '')
            list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

    #for line in list_of_rows: print ', '.join(line)


    outfile = open("./2014.csv", "wb")
    writer = csv.writer(outfile)
    writer.writerow(["Rk", "Player", "Team", "Pos", "Att", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Long", "1st", "1st%", "20+", "40+", "FUM"])
    writer.writerows(list_of_rows)
    x = x + 0
    outfile.close()

Thanks much in advance.

Here's my revised code that seems to be deleting each page as it writes to the csv file.

import csv
import requests
from BeautifulSoup import BeautifulSoup

url_template = 'http://www.nfl.com/stats/categorystats?tabSeq=0&season=2014&seasonType=REG&experience=&Submit=Go&archive=false&d-447263-p=%s&conference=null&statisticCategory=PASSING&qualified=false'

for p in ['1','2','3']:
    url = url_template % p
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    table = soup.find('table', attrs={'class': 'data-table1'})

    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&#39', '')
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)

    #for line in list_of_rows: print ', '.join(line)

        outfile = open("./2014Passing.csv", "wb")
        writer = csv.writer(outfile)
        writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
        writer.writerows(list_of_rows)

outfile.close()

Assuming that you just want to change the page number, you could do something like this and use string formatting :

url_template = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=%s&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
for page in [1,2,3]:
  url = url_template % page
  response = requests.get(url)
  # Rest of the processing code can go here
  outfile = open("./2014.csv", "ab")
  writer = csv.writer(outfile)
  writer.writerow(...)
  writer.writerows(list_of_rows)
  outfile.close()

Note that you should open the file in append mode ("ab") instead of write mode ("wb"), as the latter overwrites existing contents, as you've experienced. Using append mode, the new contents are written at the end of the file.

This is outside the scope of the question, and more of a friendly code improvement suggestion, but the script would become easier to think about if you split it up into smaller functions that each do one thing, eg, get the data from the site, write it to csv, etc..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM