[英]Not able to scrape html table from a website using python script
我实际上是在尝试抓取此链接中显示的表中的"Name"
列并将其保存为 csv 文件。
我写了一个像下面这样的python脚本:
from bs4 import BeautifulSoup
import requests
import csv
# Step 1: Sending a HTTP request to a URL
url = "https://myaccount.umn.edu/lookup?SET_INSTITUTION=UMNTC&type=name&CN=University+of+Minnesota&campus=a&role=any"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Step 2: Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
#Get the table having the class wikitable
gdp_table = soup.find("table")
gdp_table_data = gdp_table.find_all("th") # contains 2 rows
# Get all the headings of Lists
headings = []
for td in gdp_table_data[0].find_all("td"):
# remove any newlines and extra spaces from left and right
headings.append(td.b.text.replace('\n', ' ').strip())
# Get all the 3 tables contained in "gdp_table"
for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
# Get headers of table i.e., Rank, Country, GDP.
t_headers = []
for th in table.find_all("th"):
# remove any newlines and extra spaces from left and right
t_headers.append(th.text.replace('\n', ' ').strip())
# Get all the rows of table
table_data = []
for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
t_row = {}
# Each table row is stored in the form of
# t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}
# find all td's(3) in tr and zip it with t_header
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
# Put the data for the table with his heading.
data[heading] = table_data
print("table_data")
但是当我运行这个脚本时,我什么也没得到。 请在这件事上给予我帮助
似乎您的列表gdp_table_data[0].find_all("td")
是空的,因此解释说您没有找到任何东西(您的 for 循环没有做任何事情)。 如果没有更多关于您的策略的背景信息,就很难提供帮助。
顺便说一句,如果您不反对使用外部库,那么使用pandas
将使抓取此类网页变得非常容易。 只是让你知道:
>>> import pandas as pd
>>> url = "https://myaccount.umn.edu/lookup?SET_INSTITUTION=UMNTC&type=name&CN=University+of+Minnesota&campus=a&role=any"
>>> df = pd.read_html(url)[0]
>>> print(df)
Name Email Work Phone Phone Dept/College
0 AIESEC at the University of Minnesota (aiesec) aiesec@umn.edu NaN NaN Student Organization
1 Ayn Rand Study Group University of Minnesota (... aynrand@umn.edu NaN NaN NaN
2 Balance UMD (balance) balance@d.umn.edu NaN NaN Student Organization
3 Christians on Campus the University of Minneso... cocumn@umn.edu NaN NaN Student Organization
4 Climb Club University of Minnesota (climb) climb@umn.edu NaN NaN Student Organization
.. ... ... ... ... ...
74 University of Minnesota Tourism Center (tourism) tourism@umn.edu NaN NaN Department
75 University of Minnesota Treasury Accounting (t... treasury@umn.edu NaN NaN Department
76 University of Minnesota Twin Cities HOSA (umnh... umnhosa@umn.edu NaN NaN Student Organization
77 University of Minnesota U Write (uwrite) NaN NaN NaN Department
78 University of Minnesota VoiceMail (cs-vcml) cs-vcml@umn.edu NaN NaN OIT Network & Design
[79 rows x 5 columns]
现在,只获取名称非常简单:
>>> print(df.Name)
0 AIESEC at the University of Minnesota (aiesec)
1 Ayn Rand Study Group University of Minnesota (...
2 Balance UMD (balance)
3 Christians on Campus the University of Minneso...
4 Climb Club University of Minnesota (climb)
...
74 University of Minnesota Tourism Center (tourism)
75 University of Minnesota Treasury Accounting (t...
76 University of Minnesota Twin Cities HOSA (umnh...
77 University of Minnesota U Write (uwrite)
78 University of Minnesota VoiceMail (cs-vcml)
Name: Name, Length: 79, dtype: object
要仅将该列导出到.csv
使用:
>>> df[["Name"]].to_csv("./filename.csv")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.