[英]Web scraping table data using beautiful soup
I am trying to learn web scraping in Python for a project using Beautiful Soup by doing the following: Scraping Kansas City Chiefs active team player name with the college attended.我正在尝试通过执行以下操作来学习 web 在 Python 中为使用 Beautiful Soup 的项目进行抓取:抓取堪萨斯城酋长队现役球员的名字,并参加了大学。 This is the url used https://www.chiefs.com/team/players-roster/ .这是 url 使用https://www.chiefs.com/team/players-roster/ 。
After compiling, I get an error saying "IndexError: list index out of range".编译后,出现“IndexError: list index out of range”错误。
I don't know if my set classes are wrong.不知道是不是我设置的类错了。 Help would be appreciated.帮助将不胜感激。
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
print(player_name,player_university)
I couldn't find a td
with class sorter-lastname selected
in the source code.我找不到在源代码中sorter-lastname selected
的td
。 You basically need the last td
in each row, so this would do:您基本上需要每一行中的最后一个td
,所以这样做:
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td')[-1].text
PS.附言。 scraping tables is extremely easy in pandas
:在pandas
中抓取表格非常容易:
import pandas as pd
df = pd.read_html('https://www.chiefs.com/team/players-roster/')
df[0].to_cvs('output.csv')
It may take a bit longer, but the output is impressive, for example the print(df[0])
:可能需要更长的时间,但 output 令人印象深刻,例如print(df[0])
:
Player # Pos HT WT Age Exp College
0 Josh Pederson NaN TE 6-5 235 24 R Louisiana-Monroe
1 Brandin Dandridge NaN WR 5-10 180 25 R Missouri Western
2 Justin Watson NaN WR 6-3 215 25 4 Pennsylvania
3 Jonathan Woodard NaN DE 6-5 271 28 3 Central Arkansas
4 Andrew Billings NaN DT 6-1 311 26 5 Baylor
.. ... ... .. ... ... ... .. ...
84 James Winchester 41.0 LS 6-3 242 32 7 Oklahoma
85 Travis Kelce 87.0 TE 6-5 256 32 9 Cincinnati
86 Marcus Kemp 85.0 WR 6-4 208 26 4 Hawaii
87 Chris Jones 95.0 DT 6-6 298 27 6 Mississippi State
88 Harrison Butker 7.0 K 6-4 196 26 5 Georgia Tech
[89 rows x 8 columns]
TL;DR: Two issues to solve: (1) indexing, (2) HTML element-queries TL;DR:要解决两个问题:(1) 索引,(2) HTML 元素查询
The Python Index Operator is represented by opening and closing square brackets: []
. Python索引运算符由左右方括号表示: []
。 The syntax, however, requires you to put a number inside the brackets.但是,该语法要求您在括号内放置一个数字。
Example: So [7]
applies indexing to the preceding iterable (all found td
s), to get the element with index 7. In Python indices are 0-based, so they start with 0 for the first element.示例:因此[7]
将索引应用于前面的可迭代对象(所有找到的td
s),以获取索引为 7 的元素。在 Python 中,索引是从 0 开始的,因此它们从 0 开始作为第一个元素。
In your statement, you take all found cells as <td>
HTML-elements of the specific classes as iterable and want to get the 8th element, by indexing with [7]
.在您的声明中,您将所有找到的单元格作为特定类的<td>
HTML 元素作为可迭代的,并希望通过使用[7]
进行索引来获取第 8 个元素。
row.find_all('td', class_='sorter-lastname selected')[7]
Are you sure there are any td
elements found in the row
?您确定在该row
中找到任何td
元素吗? If some are found, can we guarantee that it are always at least 8.如果找到了一些,我们能保证它总是至少 8 个吗?
In this case, the were apparently less than 8 elements.在这种情况下,显然少于 8 个元素。
That's why Python would raise an IndexError
, eg in given script line 15:这就是为什么 Python 会引发IndexError
,例如在给定的脚本第 15 行中:
Traceback (most recent call last):
File "<stdin>", line 15, in <module>
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
IndexError: list index out of range
Better test on length before indexing:在索引之前更好地测试长度:
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
print(f"person row: {row}") # debug-print helps to fix element-query
player_name = row.find('td', class_='sorter-lastname selected"')
cells = row.find_all('td', class_='sorter-lastname selected')
player_university = None # define a default to avoid NameError
if len(cells) > 7: # test on minimum length of 8 for index 7
player_university = cells[7].text
print(player_name, player_university)
When the index was fixed, the queried names returned empty results as None, None
.当索引固定时,查询的名称返回空结果None, None
。
We need to debug (thus I added the print inside the loop) and adjust the queries: (1) for the university-name : If you follow RJ's answer and choose the last cell without any class-condition then a negative index like -1
means from backwards, like here: the last.我们需要调试(因此我在循环中添加了打印)并调整查询:(1)对于大学名称:如果您按照RJ 的回答并选择没有任何类条件的最后一个单元格,那么负索引如-1
意思是从后面开始,就像这里:最后一个。 The number of cells should be at least 1 or greater than 0.单元格数应至少为 1 或大于 0。
(2) for the player-name : It appears to be in the first cell (also with a CSS-class for sorting), nested either in a link-title <a.. title="Player Name">
or in following sibling as inner text of span > a
. (2) 对于player-name :它似乎在第一个单元格中(也带有用于排序的 CSS 类),嵌套在链接标题<a.. title="Player Name">
或以下兄弟中作为span > a
内部文本。
You may use CSS selectors for that an bs4's select
or select_one
functions.您可以为 bs4 的select
或select_one
函数使用CSS 选择器。 Then you can select the path like td >? >? > a
那你可以select这样的路径td >? >? > a
td >? >? > a
td >? >? > a
and get
the title. td >? >? > a
并get
标题。
Note: the ?
注意: ?
placeholders are left as challenging exercise for you.)占位符留给你作为具有挑战性的练习。)
️ Tip: most browsers have an inspector (right click on the element, eg the player-name), then choose "inspect element" and an HTML source view opens selecting the element. ️ 提示:大多数浏览器都有检查器(右键单击元素,例如玩家名称),然后选择“检查元素”,然后 HTML 源视图将打开并选择该元素。 Right-click again to "Copy" the element as "CSS selector".再次右键单击以将元素“复制”为“CSS 选择器”。
About indexing, and the magic of negative numbers like [-1]
:关于索引,以及像[-1]
这样的负数的魔力:
.. a bit further, about slicing : ..更进一步,关于切片:
Research on Beautiful Soup here:在这里研究美丽的汤:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.