![](/img/trans.png)
[英]Beautifulsoup scraping specific table in page with multiple tables
[英]Scraping with BeautifulSoup: Scraping a specific column in a table, from a HTML page
我試圖在使用 Python 的Shakemap 站點上獲取代碼為“CATAC2021”的第二列下的數據,其中“aaaa”是后面的四個字母(例如 aaaa、aaab 等)。 這些是事件的 ID。
我嘗試使用下面的代碼來訪問表的第二列並從行中檢索 ID 數據,但到目前為止我似乎沒有成功。 有誰知道我哪里出錯了/如何糾正這個問題?
from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://shakemapcam.ethz.ch/archive/').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
if 'CATAC2021' in th.string:
desired_columns.append([headers.index(th), th.getText()])
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
cells = row.findAll('td')
row_name = row.findNext('th').getText()
for column in desired_columns:
print(cells[column[0]].text, row_name, column[1])
我會在這里使用 pandas 來抓取表格,然后使用正則表達式提取模式(在四位數字之后和第一個/
之前。請注意,雖然這是一個Event ID
列,所以請確保您知道其中的區別。我將其命名為eventId
。
import pandas as pd
url = 'http://shakemapcam.ethz.ch/archive/'
df = pd.read_html(url, header =0)[-1]
df['eventID'] = df['Name/Epicenter'].str.extract(r'(.*)\d{4}(.*)(\s//?.*)(//?.*)')[1]
df['prefix'] = df['Name/Epicenter'].str.extract(r'(.*)\d{4}(.*)(\s//?.*)(//?.*)')[0]
Output:
print(df[['Name/Epicenter','prefix','eventId']])
Name/Epicenter prefix eventId
0 CATAC2021efod / 6.354496002 / -76.18144226 CATAC efod
1 CATAC2021edxe / 15.67289066 / -93.40866852 CATAC edxe
2 CATAC2021ebzg / 9.406171799 / -84.55581665 CATAC ebzg
3 CATAC2021eayx / 14.03658199 / -92.30122375 CATAC eayx
4 CATAC2021eayx / 14.03546429 / -92.30183411 CATAC eayx
... ... ...
1574 ineterloc2018acor / 12.21397209 / -86.7282486 ineterloc acor
1575 ineterloc2018acor / 12.21113586 / -86.73029327 ineterloc acor
1576 ineterloc2018acor / 12.20839691 / -86.73122406 ineterloc acor
1577 ineterloc2018aatd / 16.59416389 / -86.35289764 ineterloc aatd
1578 ineterloc2018aatd / 16.64553833 / -86.26078796 ineterloc aatd
[1579 rows x 3 columns]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.