简体   繁体   English

只写第一行到csv

[英]Only writing first row to csv

I am trying to scrape page. 我正在尝试抓取页面。 I can get it to pull all the data and save it to array objects but cannot get my for loop to iterate over every index of the arrays and output those to CSV. 我可以获取所有数据并将其保存到数组对象,但是无法让我的for循环遍历数组的每个索引并将其输出到CSV。 It will write the headers and the first object. 它将写入标题和第一个对象。 Novice to writing code so any help is appreciated. 编写代码的新手,因此感谢您的帮助。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.sports-reference.com/cfb/schools/air-force/'

# Open Connection & Grabbing the Page
uClient = uReq(my_url)

#Creating variable to Save the Page
page_html = uClient.read()

#Closing the connection
uClient.close()

#Parse the data to HTML
page_soup = soup(page_html, "html.parser")

#Grab container info from the DOM
containers = page_soup.findAll("div",{"class":"overthrow table_container"})

filename = "airforce.csv"
f = open(filename, "w")

headers = "year, wins, losses, ties, wl, sos\n"

f.write(headers)

for container in containers:
 #Find all years
 year_container = container.findAll("td",{"data-stat":"year_id"})
 year = year_container[0].text

 #Find number of Wins
 wins_container = container.findAll("td",{"data-stat":"wins"})
 wins = wins_container[0].text

 #Find number of Wins
 losses_container = container.findAll("td",{"data-stat":"losses"})
 losses = losses_container[0].text

 #Number of Ties if any
 ties_container = container.findAll("td",{"data-stat":"ties"})
 ties = ties_container[0].text

 #Win-Loss as a percentage
 wl_container = container.findAll("td",{"data-stat":"win_loss_pct"})
 wl = wl_container[0].text


 #Strength of Schedule. Can be +/- w/0 being average
 sos_container = container.findAll("td",{"data-stat":"sos"})
 sos = sos_container[0].text

 f.write(year + "," + wins + "," + losses + "," + ties + "," + wl + "," + 
 sos + "\n")

f.close()

You want to find the table (body) and then iterate over the table rows that are not header rows, ie all rows that don't have a class. 您要查找表(正文),然后遍历不是标题行的表行,即所有没有类的行。

For writing (and reading) CSV files there is a csv module in the standard library. 为了写入(和读取)CSV文件,标准库中有一个csv模块。

import csv
from urllib.request import urlopen

import bs4


def iter_rows(html):
    headers = ['year_id', 'wins', 'losses', 'ties', 'win_loss_pct', 'sos']
    yield headers

    soup = bs4.BeautifulSoup(html, 'html.parser')
    table_body_node = soup.find('table', 'stats_table').tbody
    for row_node in table_body_node('tr'):
        if not row_node.get('class'):
            yield [
                row_node.find('td', {'data-stat': header}).text
                for header in headers
            ]


def main():
    url = 'https://www.sports-reference.com/cfb/schools/air-force/'
    with urlopen(url) as response:
        html = response.read()

    with open('airforce.csv', 'w') as csv_file:
        csv.writer(csv_file).writerows(iter_rows(html))   


if __name__ == '__main__':
    main()

Pulling up the html source code, there is only one container to be put into your container list. 提取html源代码,只有一个容器可以放入您的容器列表中。 Which means that your for loop is trying to access the wrong information. 这意味着您的for循环正在尝试访问错误的信息。

You should use a range() generator to access the different elements of td that reside inside of the one item in your containers list. 您应该使用range()生成器来访问位于containers列表中一项内的td的不同元素。

try this 尝试这个

#number of records to iterate over
num = len(list(containers.findAll("td",{"data-stat":"year_id"})))

for i in range(num):
    #Find all years
    year_container = containers.findAll("td",{"data-stat":"year_id"})
    year = year_containers[i].text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM