Web 刮表数据使用美汤

Question

I am trying to learn web scraping in Python for a project using Beautiful Soup by doing the following: Scraping Kansas City Chiefs active team player name with the college attended.我正在尝试通过执行以下操作来学习 web 在 Python 中为使用 Beautiful Soup 的项目进行抓取：抓取堪萨斯城酋长队现役球员的名字，并参加了大学。 This is the url used https://www.chiefs.com/team/players-roster/ .这是 url 使用https://www.chiefs.com/team/players-roster/ 。

After compiling, I get an error saying "IndexError: list index out of range".编译后，出现“IndexError: list index out of range”错误。

I don't know if my set classes are wrong.不知道是不是我设置的类错了。 Help would be appreciated.帮助将不胜感激。

import requests
from bs4 import BeautifulSoup

url = "https://www.chiefs.com/team/players-roster/"

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')

for person in roster_table.find_all('tbody'):
    rows = person.find_all('tr')
    for row in rows:
        player_name = row.find('td', class_='sorter-lastname selected"')
        player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
        print(player_name,player_university)

Answer 1

I couldn't find a td with class sorter-lastname selected in the source code.我找不到在源代码中sorter-lastname selected的td 。 You basically need the last td in each row, so this would do:您基本上需要每一行中的最后一个td ，所以这样做：

for person in roster_table.find_all('tbody'):
    rows = person.find_all('tr')
    for row in rows:
        player_name = row.find('td', class_='sorter-lastname selected"')
        player_university = row.find_all('td')[-1].text

PS.附言。 scraping tables is extremely easy in pandas :在pandas中抓取表格非常容易：

import pandas as pd

df = pd.read_html('https://www.chiefs.com/team/players-roster/')
df[0].to_cvs('output.csv')

It may take a bit longer, but the output is impressive, for example the print(df[0]) :可能需要更长的时间，但 output 令人印象深刻，例如print(df[0]) ：

               Player     # Pos    HT   WT  Age Exp            College
0       Josh Pederson   NaN  TE   6-5  235   24   R   Louisiana-Monroe
1   Brandin Dandridge   NaN  WR  5-10  180   25   R   Missouri Western
2       Justin Watson   NaN  WR   6-3  215   25   4       Pennsylvania
3    Jonathan Woodard   NaN  DE   6-5  271   28   3   Central Arkansas
4     Andrew Billings   NaN  DT   6-1  311   26   5             Baylor
..                ...   ...  ..   ...  ...  ...  ..                ...
84   James Winchester  41.0  LS   6-3  242   32   7           Oklahoma
85       Travis Kelce  87.0  TE   6-5  256   32   9         Cincinnati
86        Marcus Kemp  85.0  WR   6-4  208   26   4             Hawaii
87        Chris Jones  95.0  DT   6-6  298   27   6  Mississippi State
88    Harrison Butker   7.0   K   6-4  196   26   5       Georgia Tech

[89 rows x 8 columns]

Answer 2

TL;DR: Two issues to solve: (1) indexing, (2) HTML element-queries TL;DR：要解决两个问题：(1) 索引，(2) HTML 元素查询

Indexing索引

The Python Index Operator is represented by opening and closing square brackets: [] . Python索引运算符由左右方括号表示： [] 。 The syntax, however, requires you to put a number inside the brackets.但是，该语法要求您在括号内放置一个数字。

Example: So [7] applies indexing to the preceding iterable (all found td s), to get the element with index 7. In Python indices are 0-based, so they start with 0 for the first element.示例：因此[7]将索引应用于前面的可迭代对象（所有找到的td s），以获取索引为 7 的元素。在 Python 中，索引是从 0 开始的，因此它们从 0 开始作为第一个元素。

In your statement, you take all found cells as <td> HTML-elements of the specific classes as iterable and want to get the 8th element, by indexing with [7] .在您的声明中，您将所有找到的单元格作为特定类的<td> HTML 元素作为可迭代的，并希望通过使用[7]进行索引来获取第 8 个元素。

row.find_all('td', class_='sorter-lastname selected')[7]

How to avoid index-errors?如何避免索引错误？

Are you sure there are any td elements found in the row ?您确定在该row中找到任何td元素吗？ If some are found, can we guarantee that it are always at least 8.如果找到了一些，我们能保证它总是至少 8 个吗？

In this case, the were apparently less than 8 elements.在这种情况下，显然少于 8 个元素。

That's why Python would raise an IndexError , eg in given script line 15:这就是为什么 Python 会引发IndexError ，例如在给定的脚本第 15 行中：

Traceback (most recent call last):
  File "<stdin>", line 15, in <module>
    player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
IndexError: list index out of range

Better test on length before indexing:在索引之前更好地测试长度：

import requests
from bs4 import BeautifulSoup

url = "https://www.chiefs.com/team/players-roster/"

html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')

for person in roster_table.find_all('tbody'):
    rows = person.find_all('tr')
    for row in rows:
        print(f"person row: {row}")  # debug-print helps to fix element-query
        player_name = row.find('td', class_='sorter-lastname selected"')
        cells = row.find_all('td', class_='sorter-lastname selected')
        player_university = None   # define a default to avoid NameError
        if len(cells) > 7:   # test on minimum length of 8 for index 7
            player_university = cells[7].text
        print(player_name, player_university)

Element-queries元素查询

When the index was fixed, the queried names returned empty results as None, None .当索引固定时，查询的名称返回空结果None, None 。

We need to debug (thus I added the print inside the loop) and adjust the queries: (1) for the university-name : If you follow RJ's answer and choose the last cell without any class-condition then a negative index like -1 means from backwards, like here: the last.我们需要调试（因此我在循环中添加了打印）并调整查询：（1）对于大学名称：如果您按照RJ 的回答并选择没有任何类条件的最后一个单元格，那么负索引如-1意思是从后面开始，就像这里：最后一个。 The number of cells should be at least 1 or greater than 0.单元格数应至少为 1 或大于 0。

(2) for the player-name : It appears to be in the first cell (also with a CSS-class for sorting), nested either in a link-title <a.. title="Player Name"> or in following sibling as inner text of span > a . (2) 对于player-name ：它似乎在第一个单元格中（也带有用于排序的 CSS 类），嵌套在链接标题<a.. title="Player Name">或以下兄弟中作为span > a内部文本。

CSS selectors CSS 选择器

You may use CSS selectors for that an bs4's select or select_one functions.您可以为 bs4 的select或select_one函数使用CSS 选择器。 Then you can select the path like td >? >? > a那你可以select这样的路径td >? >? > a td >? >? > a td >? >? > a and get the title. td >? >? > a并get标题。

Note: the ?注意： ? placeholders are left as challenging exercise for you.)占位符留给你作为具有挑战性的练习。）

️ Tip: most browsers have an inspector (right click on the element, eg the player-name), then choose "inspect element" and an HTML source view opens selecting the element. ️ 提示：大多数浏览器都有检查器（右键单击元素，例如玩家名称），然后选择“检查元素”，然后 HTML 源视图将打开并选择该元素。 Right-click again to "Copy" the element as "CSS selector".再次右键单击以将元素“复制”为“CSS 选择器”。

Web 刮表数据使用美汤

问题描述

2 个解决方案

解决方案1
0 2022-02-10 19:32:01

解决方案2
0 2022-02-19 00:34:03

Indexing索引

How to avoid index-errors?如何避免索引错误？

Element-queries元素查询

CSS selectors CSS 选择器

Further Reading延伸阅读

Web 刮表数据使用美汤

问题描述

2 个解决方案

解决方案1 0 2022-02-10 19:32:01

解决方案2 0 2022-02-19 00:34:03

Indexing索引

How to avoid index-errors?如何避免索引错误？

Element-queries元素查询

CSS selectors CSS 选择器

Further Reading延伸阅读

解决方案1
0 2022-02-10 19:32:01

解决方案2
0 2022-02-19 00:34:03