简体   繁体   中英

Scraping with BeautifulSoup: Scraping a specific column in a table, from a HTML page

I'm trying to get a hold of the data under the second column having the code "CATAC2021", where "aaaa" are the four letter that follow (eg. aaaa, aaab, etc) on the Shakemap Site using Python. These are the ID of the event.

I have tried to use the following code below to access the second column of the table and retrieve the ID data from the row but I seem to be having no success so far. Does anyone know where I have gone wrong/how to correct this?

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://shakemapcam.ethz.ch/archive/').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'CATAC2021' in th.string:
        desired_columns.append([headers.index(th), th.getText()])

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells = row.findAll('td')
    row_name = row.findNext('th').getText()
    for column in desired_columns:
        print(cells[column[0]].text, row_name, column[1])

I'd use pandas here to grab the table, then use regex to pull out the pattern (following the four digit and before the first / . Note though that ther eis an Event ID column, so just be sure you know the difference. I named it eventId .

import pandas as pd

url = 'http://shakemapcam.ethz.ch/archive/'
df = pd.read_html(url, header =0)[-1]
df['eventID'] = df['Name/Epicenter'].str.extract(r'(.*)\d{4}(.*)(\s//?.*)(//?.*)')[1]
df['prefix'] = df['Name/Epicenter'].str.extract(r'(.*)\d{4}(.*)(\s//?.*)(//?.*)')[0]

Output:

print(df[['Name/Epicenter','prefix','eventId']])
                                      Name/Epicenter     prefix eventId
0         CATAC2021efod / 6.354496002 / -76.18144226      CATAC    efod
1         CATAC2021edxe / 15.67289066 / -93.40866852      CATAC    edxe
2         CATAC2021ebzg / 9.406171799 / -84.55581665      CATAC    ebzg
3         CATAC2021eayx / 14.03658199 / -92.30122375      CATAC    eayx
4         CATAC2021eayx / 14.03546429 / -92.30183411      CATAC    eayx
                                             ...        ...     ...
1574   ineterloc2018acor / 12.21397209 / -86.7282486  ineterloc    acor
1575  ineterloc2018acor / 12.21113586 / -86.73029327  ineterloc    acor
1576  ineterloc2018acor / 12.20839691 / -86.73122406  ineterloc    acor
1577  ineterloc2018aatd / 16.59416389 / -86.35289764  ineterloc    aatd
1578  ineterloc2018aatd / 16.64553833 / -86.26078796  ineterloc    aatd

[1579 rows x 3 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM