繁体   English   中英

如何将此 1 项列表分成多个列表?

[英]How do I separate this 1 item list into multiple lists?

from urllib.request import urlopen
from bs4 import BeautifulSoup
import numpy as np
table_body = soup.findAll('tbody', class_ = lambda table_rows: table_rows != "thead")
table_data = [[td.getText() for td in table_body[i].findAll('td')]
                for i in range(len(table_body))]

我正在开展一个项目,该项目将从https://www.pro-football-reference.com/years/2021/passing.htm中删除数据。 我的用于抓取表格标题的代码可以正常工作,但是我在格式化表格主体时遇到了很多麻烦,因为它会将玩家统计信息分成几行。 当我运行print(table_data)时,我的结果是一个打印以下内容的单项列表:

[['Tom Brady*', 'TAM', '44', 'QB', '17', '17', '13-4-0', '485', '719', '67.5', '5316', '43', '6', '12', '1.7', '269', '62', '7.4', '7.8', '11.0', '312.7', '102.1', '68.1', '22', '144', '3', '6.98', '7.41', '3', '5', 'Justin Herbert*', 'LAC', '23', 'QB', '17', '17', '9-8-0', '443', '672', '65.9', '5014', '38', '5.7', '15', '2.2', '256', '72', '7.5', '7.6', '11.3', '294.9', '97.7', '65.6', '31', '214', '4.4', '6.83', '6.95', '5', '5', 'Matthew Stafford', 'LAR', '33', 'QB', '17', '17', '12-5-0', '404', '601', '67.2', '4886', '41', '6.8', '17', '2.8', '233', '79', '8.1', '8.2', '12.1', '287.4', '102.9', '63.8', '30', '243', '4.8', '7.36', '7.45', '3', '4',....]]

如何将这一项列表分成多个列表,以便实现我想要的输出:

[
['Tom Brady*', 'TAM', '44', 'QB', '17', '17', '13-4-0', '485', '719', '67.5', '5316', '43', '6', '12', '1.7', '269', '62', '7.4', '7.8', '11.0', '312.7', '102.1', '68.1', '22', '144', '3', '6.98', '7.41', '3', '5']
['Justin Herbert*', 'LAC', '23', 'QB', '17', '17', '9-8-0', '443', '672', '65.9', '5014', '38', '5.7', '15', '2.2', '256', '72', '7.5', '7.6', '11.3', '294.9', '97.7', '65.6', '31', '214', '4.4', '6.83', '6.95', '5', '5']
['Matthew Stafford', 'LAR', '33', 'QB', '17', '17', '12-5-0', '404', '601', '67.2', '4886', '41', '6.8', '17', '2.8', '233', '79', '8.1', '8.2', '12.1', '287.4', '102.9', '63.8', '30', '243', '4.8', '7.36', '7.45', '3', '4']
['Patrick Mahomes'...]
['Derek Carr'...]
]

迭代表的行,并为每一行遍历其<td>以获取其文本:

[[e.text for e in r.select('td')] for row in soup.select('tbody tr')]

输出:

[['Tom Brady*', 'TAM', '44', 'QB', '17', '17', '13-4-0', '485', '719', '67.5', '5316', '43', '6', '12', '1.7', '269', '62', '7.4', '7.8', '11.0', '312.7', '102.1', '68.1', '22', '144', '3', '6.98', '7.41', '3', '5'], ['Justin Herbert*', 'LAC', '23', 'QB', '17', '17', '9-8-0', '443', '672', '65.9', '5014', '38', '5.7', '15', '2.2', '256', '72', '7.5', '7.6', '11.3', '294.9', '97.7', '65.6', '31', '214', '4.4', '6.83', '6.95', '5', '5'], ['Matthew Stafford', 'LAR', '33', 'QB', '17', '17', '12-5-0', '404', '601', '67.2', '4886', '41', '6.8', '17', '2.8', '233', '79', '8.1', '8.2', '12.1', '287.4', '102.9', '63.8', '30', '243', '4.8', '7.36', '7.45', '3', '4'], ['Patrick Mahomes*', 'KAN', '26', 'QB', '17', '17', '12-5-0', '436', '658', '66.3', '4839', '37', '5.6', '13', '2', '260', '75', '7.4', '7.6', '11.1', '284.6', '98.5', '62.2', '28', '146', '4.1', '6.84', '7.07', '3', '3'], ['Derek Carr', 'LVR', '30', 'QB', '17', '17', '10-7-0', '428', '626', '68.4', '4804', '23', '3.7', '14', '2.2', '217', '61', '7.7', '7.4', '11.2', '282.6', '94.0', '52.4', '40', '241', '6', '6.85', '6.60', '3', '6'], ['Joe Burrow', 'CIN', '25', 'QB', '16', '16', '10-6-0', '366', '520', '70.4', '4611', '34', '6.5', '14', '2.7', '202', '82', '8.9', '9.0', '12.6', '288.2', '108.3', '54.3', '51', '370', '8.9', '7.43', '7.51', '2', '3'], ['Dak Prescott', 'DAL', '28', 'QB', '16', '16', '11-5-0', '410', '596', '68.8', '4449', '37', '6.2', '10', '1.7', '227', '51', '7.5', '8.0', '10.9', '278.1', '104.2', '54.6', '30', '144', '4.8', '6.88', '7.34', '1', '2'], ['Josh Allen', 'BUF', '25', 'QB', '17', '17', '11-6-0', '409', '646', '63.3', '4407', '36', '5.6', '15', '2.3', '234', '61', '6.8', '6.9', '10.8', '259.2', '92.2', '60.7', '26', '164', '3.9', '6.31', '6.38', '', ''], ['Kirk Cousins*', 'MIN', '33', 'QB', '16', '16', '8-8-0', '372', '561', '66.3', '4221', '33', '5.9', '7', '1.2', '192', '64', '7.5', '8.1', '11.3', '263.8', '103.1', '52.3', '28', '197', '4.8', '6.83', '7.42', '3', '4'], ['Aaron Rodgers*+', 'GNB', '38', 'QB', '16', '16', '13-3-0', '366', '531', '68.9', '4115', '37', '7', '4', '0.8', '213', '75', '7.7', '8.8', '11.2', '257.2', '111.9', '69.1', '30', '188', '5.3', '7.00', '8.00', '1', '2'], ['Matt Ryan', 'ATL', '36', 'QB', '17', '17', '7-10-0', '375', '560', '67', '3968', '20', '3.6', '12', '2.1', '195', '64', '7.1', '6.8', '10.6', '233.4', '90.4', '46.1', '40', '274', '6.7', '6.16', '5.92', '3', '4'], ['Jimmy Garoppolo', 'SFO', '30', 'QB', '15', '15', '9-6-0', '301', '441', '68.3', '3810', '20', '4.5', '12', '2.7', '172', '83', '8.6', '8.3', '12.7', '254.0', '98.7', '53.3', '29', '201', '6.2', '7.68', '7.38', '3', '3'],...]

只是指出pandas.read_html()的替代方案,这将是该任务的一种简单而常见的方法,同时为您使用引擎盖下的beautifulsoup

例子
import pandas as pd

#read the first table from url into dataframe
df = pd.read_html('https://www.pro-football-reference.com/years/2021/passing.htm')[0]
#select only rows that are not subheaders
df[df['Rk'] != 'Rk'] 
输出
Rk 播放器 Tm值 年龄 位置 G GS Q布雷克 Cmp 攻击者 生产百分比 运输署 TD% 诠释 整数% 一维 液化天然气 是/A 是/一 Y/C Y/G 速度 QBR 斯克 码数.1 百分比 纽约/美国 任何/一个 4QC GWD
1 汤姆布雷迪* 44 QB 17 17 13-4-0 485 719 67.5 5316 43 6 12 1.7 269 62 7.4 7.8 11 312.7 102.1 68.1 22 144 3 6.98 7.41 3 5
2 贾斯汀赫伯特* 拉丁美洲和加勒比 23 QB 17 17 9-8-0 443 672 65.9 5014 38 5.7 15 2.2 256 72 7.5 7.6 11.3 294.9 97.7 65.6 31 214 4.4 6.83 6.95 5 5
3 马修斯塔福德 拉尔 33 QB 17 17 12-5-0 404 601 67.2 4886 41 6.8 17 2.8 233 79 8.1 8.2 12.1 287.4 102.9 63.8 30 243 4.8 7.36 7.45 3 4
4 帕特里克·马霍姆斯* 菅直人 26 QB 17 17 12-5-0 436 658 66.3 4839 37 5.6 13 2 260 75 7.4 7.6 11.1 284.6 98.5 62.2 28 146 4.1 6.84 7.07 3 3
5 德里克·卡尔 LVR 30 QB 17 17 10-7-0 428 626 68.4 4804 23 3.7 14 2.2 217 61 7.7 7.4 11.2 282.6 94 52.4 40 241 6 6.85 6.6 3 6
6 乔·伯罗 CIN 25 QB 16 16 10-6-0 366 520 70.4 4611 34 6.5 14 2.7 202 82 8.9 9 12.6 288.2 108.3 54.3 51 370 8.9 7.43 7.51 2 3
7 达克普雷斯科特 达尔 28 QB 16 16 11-5-0 410 596 68.8 4449 37 6.2 10 1.7 227 51 7.5 8 10.9 278.1 104.2 54.6 30 144 4.8 6.88 7.34 1 2

...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM