简体   繁体   中英

How to parse this? Trying to pull data from non-HTML webpage using BeautifulSoup and Python

BeautifulSoup & HTML novice here, and I've never seen this type of page before. I'm trying to pull data from the 2008 presidential race in Dane County, Wisconsin.

Link: https://www.countyofdane.com/clerk/elect2008d.html

The data for the presidential race is in what appears to be a hard coded table? It isn't stored in between HTML tags, or anything I've come across before.

Can I pull the data by iterating through the < !-- #--> thing somehow? Should I save the page as an HTML file and add in a body tag around the table so its easier to parse?

This problem actually comes to text parsing since the tables are in plain text inside a pre element.

Here is what you can start with. The idea is to detect a beginning and an end of a table by using the ----- headers and empty lines after the tables. Something along these lines:

import re

from bs4 import BeautifulSoup
import requests
from ppprint import pprint

url = "https://www.countyofdane.com/clerk/elect2008d.html"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")

is_table_row = False

tables = []
for line in soup.pre.get_text().splitlines():
    # beginning of the table
    if not is_table_row and "-----" in line:
        is_table_row = True
        table = []
        continue

    # end of the table
    if is_table_row and not line.strip():
        is_table_row = False
        tables.append(table)
        continue

    if is_table_row:
        table.append(re.split("\s{2,}", line))  # splitting by 2 or more spaces

pprint(tables)

This would print a list of lists - a sublist with data rows for every table:

[
    [
        ['0001 T ALBION WDS 1-2', '753', '315', '2', '4', '1', '0', '5', '2', '0', '1'],
        ['0002 T BERRY WDS 1-2', '478', '276', '0', '0', '0', '0', '2', '0', '0', '1'],
        ...
        ['', 'CANDIDATE TOTALS', '205984', '73065', '435', '983', '103', '20', '1491', '316', '31', '511'],
        ['', 'CANDIDATE PERCENT', '72.80', '25.82', '.15', '.34', '.03', '.52', '.11', '.01', '.18']],
    [
        ['0001 T ALBION WDS 1-2', '726', '323', '0'],
        ['0002 T BERRY WDS 1-2', '457', '290', '1'],
        ['0003 T BLACK EARTH', '180', '107', '0'],
        ...
    ],
    ...
]

This, of course, does not include the table names and diagonal headers which can be challenging to get, but not impossible. Plus, you would probably want to separate total rows from the other data rows of a table. In any case, I think this can be a good starting example for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM