简体   繁体   English

Web 刮表数据使用美汤

[英]Web scraping table data using beautiful soup

I am trying to learn web scraping in Python for a project using Beautiful Soup by doing the following: Scraping Kansas City Chiefs active team player name with the college attended.我正在尝试通过执行以下操作来学习 web 在 Python 中为使用 Beautiful Soup 的项目进行抓取:抓取堪萨斯城酋长队现役球员的名字,并参加了大学。 This is the url used https://www.chiefs.com/team/players-roster/ .这是 url 使用https://www.chiefs.com/team/players-roster/

After compiling, I get an error saying "IndexError: list index out of range".编译后,出现“IndexError: list index out of range”错误。

I don't know if my set classes are wrong.不知道是不是我设置的类错了。 Help would be appreciated.帮助将不胜感激。

import requests
from bs4 import BeautifulSoup

url = "https://www.chiefs.com/team/players-roster/"

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')

for person in roster_table.find_all('tbody'):
    rows = person.find_all('tr')
    for row in rows:
        player_name = row.find('td', class_='sorter-lastname selected"')
        player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
        print(player_name,player_university)

I couldn't find a td with class sorter-lastname selected in the source code.我找不到在源代码中sorter-lastname selectedtd You basically need the last td in each row, so this would do:您基本上需要每一行中的最后一个td ,所以这样做:

for person in roster_table.find_all('tbody'):
    rows = person.find_all('tr')
    for row in rows:
        player_name = row.find('td', class_='sorter-lastname selected"')
        player_university = row.find_all('td')[-1].text

PS.附言。 scraping tables is extremely easy in pandas :pandas中抓取表格非常容易:

import pandas as pd

df = pd.read_html('https://www.chiefs.com/team/players-roster/')
df[0].to_cvs('output.csv')

It may take a bit longer, but the output is impressive, for example the print(df[0]) :可能需要更长的时间,但 output 令人印象深刻,例如print(df[0])

               Player     # Pos    HT   WT  Age Exp            College
0       Josh Pederson   NaN  TE   6-5  235   24   R   Louisiana-Monroe
1   Brandin Dandridge   NaN  WR  5-10  180   25   R   Missouri Western
2       Justin Watson   NaN  WR   6-3  215   25   4       Pennsylvania
3    Jonathan Woodard   NaN  DE   6-5  271   28   3   Central Arkansas
4     Andrew Billings   NaN  DT   6-1  311   26   5             Baylor
..                ...   ...  ..   ...  ...  ...  ..                ...
84   James Winchester  41.0  LS   6-3  242   32   7           Oklahoma
85       Travis Kelce  87.0  TE   6-5  256   32   9         Cincinnati
86        Marcus Kemp  85.0  WR   6-4  208   26   4             Hawaii
87        Chris Jones  95.0  DT   6-6  298   27   6  Mississippi State
88    Harrison Butker   7.0   K   6-4  196   26   5       Georgia Tech

[89 rows x 8 columns]

TL;DR: Two issues to solve: (1) indexing, (2) HTML element-queries TL;DR:要解决两个问题:(1) 索引,(2) HTML 元素查询

Indexing索引

The Python Index Operator is represented by opening and closing square brackets: [] . Python索引运算符由左右方括号表示: [] The syntax, however, requires you to put a number inside the brackets.但是,该语法要求您在括号内放置一个数字

Example: So [7] applies indexing to the preceding iterable (all found td s), to get the element with index 7. In Python indices are 0-based, so they start with 0 for the first element.示例:因此[7]将索引应用于前面的可迭代对象(所有找到的td s),以获取索引为 7 的元素。在 Python 中,索引是从 0 开始的,因此它们从 0 开始作为第一个元素。

In your statement, you take all found cells as <td> HTML-elements of the specific classes as iterable and want to get the 8th element, by indexing with [7] .在您的声明中,您将所有找到的单元格作为特定类的<td> HTML 元素作为可迭代的,并希望通过使用[7]进行索引来获取第 8 个元素。

row.find_all('td', class_='sorter-lastname selected')[7]

How to avoid index-errors?如何避免索引错误?

Are you sure there are any td elements found in the row ?您确定在该row中找到任何td元素吗? If some are found, can we guarantee that it are always at least 8.如果找到了一些,我们能保证它总是至少 8 个吗?

In this case, the were apparently less than 8 elements.在这种情况下,显然少于 8 个元素。

That's why Python would raise an IndexError , eg in given script line 15:这就是为什么 Python 会引发IndexError ,例如在给定的脚本第 15 行中:

Traceback (most recent call last):
  File "<stdin>", line 15, in <module>
    player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
IndexError: list index out of range

Better test on length before indexing:在索引之前更好地测试长度

import requests
from bs4 import BeautifulSoup

url = "https://www.chiefs.com/team/players-roster/"

html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')

for person in roster_table.find_all('tbody'):
    rows = person.find_all('tr')
    for row in rows:
        print(f"person row: {row}")  # debug-print helps to fix element-query
        player_name = row.find('td', class_='sorter-lastname selected"')
        cells = row.find_all('td', class_='sorter-lastname selected')
        player_university = None   # define a default to avoid NameError
        if len(cells) > 7:   # test on minimum length of 8 for index 7
            player_university = cells[7].text
        print(player_name, player_university)

Element-queries元素查询

When the index was fixed, the queried names returned empty results as None, None .当索引固定时,查询的名称返回空结果None, None

We need to debug (thus I added the print inside the loop) and adjust the queries: (1) for the university-name : If you follow RJ's answer and choose the last cell without any class-condition then a negative index like -1 means from backwards, like here: the last.我们需要调试(因此我在循环中添加了打印)并调整查询:(1)对于大学名称:如果您按照RJ 的回答并选择没有任何类条件的最后一个单元格,那么负索引如-1意思是从后面开始,就像这里:最后一个。 The number of cells should be at least 1 or greater than 0.单元格数应至少为 1 或大于 0。

(2) for the player-name : It appears to be in the first cell (also with a CSS-class for sorting), nested either in a link-title <a.. title="Player Name"> or in following sibling as inner text of span > a . (2) 对于player-name :它似乎在第一个单元格中(也带有用于排序的 CSS 类),嵌套在链接标题<a.. title="Player Name">或以下兄弟中作为span > a内部文本。

CSS selectors CSS 选择器

You may use CSS selectors for that an bs4's select or select_one functions.您可以为 bs4 的selectselect_one函数使用CSS 选择器 Then you can select the path like td >? >? > a那你可以select这样的路径td >? >? > a td >? >? > a td >? >? > a and get the title. td >? >? > aget标题。

Note: the ?注意: ? placeholders are left as challenging exercise for you.)占位符留给你作为具有挑战性的练习。)

️ Tip: most browsers have an inspector (right click on the element, eg the player-name), then choose "inspect element" and an HTML source view opens selecting the element. ️ 提示:大多数浏览器都有检查器(右键单击元素,例如玩家名称),然后选择“检查元素”,然后 HTML 源视图将打开并选择该元素。 Right-click again to "Copy" the element as "CSS selector".再次右键单击以将元素“复制”为“CSS 选择器”。

Further Reading延伸阅读

About indexing, and the magic of negative numbers like [-1] :关于索引,以及像[-1]这样的负数的魔力:

.. a bit further, about slicing : ..更进一步,关于切片

Research on Beautiful Soup here:在这里研究美丽的汤:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM