简体   繁体   English

这种刮擦我哪里出错了?

[英]Where am I going wrong with this scraping?

It should be really simple, but I'm struggling to pull out each row from this NCAA table (eg Florida State, ACC, 22-1-2') etc.它应该非常简单,但我正在努力从这张 NCAA 表中提取每一行(例如 Florida State、ACC、22-1-2')等。

I guess my main question here is, where do I start?我想我的主要问题是,我从哪里开始? What am I looking for?我在找什么? Do I search for the 'div' tag, or the 'tbody' tag or the 'tr' tag - either one i try with find_all or find or even select using the CSS selector, returns nothing.我是否搜索'div'标签,或'tbody'标签或'tr'标签-我尝试使用find_all或find甚至select使用CSS选择器,什么都不返回。

https://www.ncaa.com/rankings/soccer-women/d1/ncaa-womens-soccer-rpi https://www.ncaa.com/rankings/soccer-women/d1/ncaa-womens-soccer-rpi

Edit: Managed to get it, see below:编辑:设法得到它,见下文:

from bs4 import BeautifulSoup
import requests
import csv

url = 'https://www.ncaa.com/rankings/soccer-women/d1/ncaa-womens-soccer-rpi'

result = requests.get(url)

soup = BeautifulSoup(result.text,'html.parser')

check = soup.find_all('tr')

names_lst = []
conference_lst = []
record_lst = []


for info in check[1:]:
    details = info.find_all('td')
    names = details[1].text.strip()
    conference = details[2].text.strip()
    record = details[3].text.strip()

    names_lst.append(names)
    conference_lst.append(conference)
    record_lst.append(record)

print(names_lst)
print(conference_lst)
print(record_lst)

with open ('ncaa_rankings.csv', 'w') as ncaa_file:
    csv_writer = csv.writer(ncaa_file)
    for names, conference, record in zip(names_lst, conference_lst, record_lst):
        csv_writer.writerow([names, conference, record])

This problem is solvable with 5 lines of code:这个问题可以用 5 行代码解决:

import pandas as pd

url = "https://www.ncaa.com/rankings/soccer-women/d1/ncaa-womens-soccer-rpi"
df = pd.read_html(url)[0]
df.to_csv("w_soccer_rpi.csv")
print(df)

Result (also saved in a csv file):结果(也保存在 csv 文件中):

Rank    School  Conference  Record  Road    Neutral Home    Non Div I
0   1   Florida St. ACC 22-1-2  6-1-1   4-0-0   12-0-1  0-0-0
1   2   Duke    ACC 16-4-1  4-1-1   0-0-0   12-3-0  0-0-0
2   3   Arkansas    SEC 19-4-1  4-3-1   4-1-0   11-0-0  0-0-0
3   4   Rutgers Big Ten 19-4-2  6-1-0   0-1-0   13-2-2  0-0-0
4   5   Michigan    Big Ten 18-4-3  5-3-2   1-0-0   12-1-1  0-0-0
... ... ... ... ... ... ... ... ...
337 338 Nicholls    Southland   0-18-0  0-10-0  0-2-0   0-6-0   0-0-0
338 339 Delaware St.    DI Independent  2-11-1  1-6-0   0-0-0   1-5-1   1-0-0
339 340 Mississippi Val.    SWAC    0-13-0  0-7-0   0-1-0   0-5-0   0-0-0
340 341 Hampton Big South   1-13-1  0-8-0   0-0-0   1-5-1   0-0-0
341 342 South Carolina St.  DI Independent  0-10-1  0-4-1   0-0-0   0-6-0   2-1-0

Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html相关 pandas 文档: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.ZFC35FDC70D5FC69D2693EZZ5A

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM