[英]Extract specific JS value from web page using BeautifulSoup Python
[英]Extract data from a specific page using Python Beautifulsoup
我对 python 和 BeautifulSoup 很陌生。 我编写了下面的代码来尝试调用网站( https://www.fangraphs.com/depthcharts.aspx?position=Team ),将表格中的数据抓取并导出到 csv 文件。 我能够编写代码来从网站上的其他表中提取数据,但不是这个特定的表。 它不断返回:AttributeError:NoneType' object 没有属性'find'。 我一直在绞尽脑汁想弄清楚我做错了什么。 我有错误的“类”名称吗? 再次,我很新,并试图自学。 我一直在通过反复试验和逆向工程他人的代码来学习。 这个让我难住了。 有什么指导吗?
import requests
import csv
import datetime
from bs4 import BeautifulSoup
# static urls
season = datetime.datetime.now().year
URL = "https://www.fangraphs.com/depthcharts.aspx?position=Team".format(season=season)
# request the data
batting_html = requests.get(URL).text
def parse_array_from_fangraphs_html(input_html, out_file_name):
"""
Take a HTML stats page from fangraphs and parse it out to a CSV file.
"""
# parse input
soup = BeautifulSoup(input_html, "lxml")
table = soup.find("table", {"class": "tablesoreder, depth_chart tablesorter tablesorter-default"})
# get headers
headers_html = table.find("thead").find_all("th")
headers = []
for header in headers_html:
headers.append(header.text)
print(headers)
# get rows
rows = []
rows_html = table.find("tbody").find_all("tr")
for row in rows_html:
row_data = []
for cell in row.find_all("td"):
row_data.append(cell.text)
rows.append(row_data)
# write to CSV file
with open(out_file_name, "w") as out_file:
writer = csv.writer(out_file)
writer.writerow(headers)
writer.writerows(rows)
parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')
回溯看起来像
AttributeError Traceback (most recent call last)
<ipython-input-4-ee944e08f675> in <module>()
41 writer.writerows(rows)
42
---> 43 parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')
<ipython-input-4-ee944e08f675> in parse_array_from_fangraphs_html(input_html, out_file_name)
20
21 # get headers
---> 22 headers_html = table.find("thead").find_all("th")
23 headers = []
24 for header in headers_html:
AttributeError: 'NoneType' object has no attribute 'find'
所以是的,问题出在
table = soup.find("table", {"class": "tablesoreder, depth_chart tablesorter tablesorter-default"})
操作说明。
您可以修改它,以便按照其他用户的建议将 class 属性拆分为空格。 但是,您将再次失败,因为已解析的表没有 tbody。
固定的脚本看起来像
import requests
import csv
import datetime
from bs4 import BeautifulSoup
# static urls
season = datetime.datetime.now().year
URL = "https://www.fangraphs.com/depthcharts.aspx?position=Team".format(season=season)
# request the data
batting_html = requests.get(URL).text
def parse_array_from_fangraphs_html(input_html, out_file_name):
"""
Take a HTML stats page from fangraphs and parse it out to a CSV file.
"""
# parse input
soup = BeautifulSoup(input_html, "lxml")
table = soup.find("table", class_=["tablesoreder,", "depth_chart", "tablesorter", "tablesorter-default"])
# get headers
headers_html = table.find("thead").find_all("th")
headers = []
for header in headers_html:
headers.append(header.text)
print(headers)
# get rows
rows = []
rows_html = table.find_all("tr")
for row in rows_html:
row_data = []
for cell in row.find_all("td"):
row_data.append(cell.text)
rows.append(row_data)
# write to CSV file
with open(out_file_name, "w") as out_file:
writer = csv.writer(out_file)
writer.writerow(headers)
writer.writerows(rows)
parse_array_from_fangraphs_html(batting_html, 'Team War Totals.csv')
将您的表语句替换为:
table = soup.find("table", attrs={"class": ["tablesoreder,", "depth_chart", "tablesorter", "tablesorter-default"]})
同样,一旦您解决了这个问题,您的 header 将无法正常工作,因为该表有一个内部有一个 tr 然后最后是 td 的thead。 因此,您必须将该语句替换为:
headers_html = table.find("thead").find("tr").find_all("th")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.