简体   繁体   中英

Saving all the columns from a html table using beautiful soup in python

I have two kinds of rows I am trying to convert to a table from a website.

The first one looks like this,

<tr id="eventRowId_750"> <td class="first left">All Day</td> <td class="flagCur left"><span class="ceFlags France float_lang_base_1" data-img_key="France" title="France"> </span></td> <td class="left textNum sentiment"><span class="bold">Holiday</span></td> <td class="left event" colspan="6">French - Flower Festival</td> </tr>

The second kind of row looks like this,

<tr class="js-event-item revised" data-event-datetime="2022/02/02 01:00:00" event_attr_id="114" id="eventRowId_444333"> <td class="first left time js-time" title="">01:00</td> <td class="left ImageCur noWrap"><span class="ceImages Australia" data-img_key="Australia" title="Australia"> </span> AUS</td> <td class="left textNum sentiment noWrap" data-img_key="bull1" title="Low Impact"><i class="grayFullBullishIcon"></i><i class="grayEmptyBullishIcon"></i><i class="grayEmptyBullishIcon"></i></td> <td class="left event" title="Click to view more info on Australian Budget"><a href="australian-budget-114" target="_blank">      Australian Budget  (Dec)</a> </td> <td class="bold act blackFont event-444333-actual" id="eventActual_444333" title="">-5M</td> <td class="fore event-444333-forecast" id="eventForecast_444333"> </td> <td class="prev greenFont event-444333-previous" id="eventPrevious_444333"><span title="Revised From -3M">-2M</span></td> <td class="alert js-injected-user-alert-container" data-event-id="114" data-name="Australian Budget" data-status-enabled="0"> <span class="js-plus-icon alertBellGrayPlus genToolTip oneliner" data-tooltip="Create Alert" data-tooltip-alt="Alert is active"></span> </td> </tr>

I am trying to convert them into rows using python and beautifulsoup. I use the following code,

for items in soup.select("tr"):
    data = [item.get_text(strip=True) for item in items.select("th,td")]
    print(data)

But my output looks like this,

['All Day', '', 'Holiday', 'French - Flower Festival']
['01:00', 'AUS', '', 'Australian Budget  (Dec)', '-5M', '', '-2M', '']

How can I make I get the "low impact" text into the third column where "holiday" is in the first column, and saving the name "France" in the first row into the second column and make it look like this?

['All Day', 'France', 'Holiday', 'French - Flower Festival']
['01:00', 'AUS', 'Low Impact', 'Australian Budget  (Dec)', '-5M', '', '-2M', '']

This part is not really important but, is it possible to save the span title if it exists in by adding it to the end of the list? the part where it says, "Revised From -3M". So it could look like this,

['All Day', '', 'Holiday', 'French - Flower Festival']
['02:45', 'AUS', 'Low Impact', 'Australian Budget  (Dec)', '-5M', '', '-2M', '', "Revised From -3M"]

It's unlikely to find a proper pattern, so here we go. I couldn't think of anything to get the title except regex, because it's not tied to a determined tag.

from bs4 import BeautifulSoup
import re

with open("example.html") as html_doc:
    soup = BeautifulSoup(html_doc, "html.parser")

for items in soup.select("tr"):
    row = []
    for item in items.select("th,td"):
        text = item.get_text(strip=True)
        if not text:
            title = re.search(r"title=\"(.*?)\"", str(item))
            if title:
                text = title.group(1)
        row.append(text)
    print(row)
# output
['All Day', 'France', 'Holiday', 'French - Flower Festival']
['01:00', 'AUS', 'Low Impact', 'Australian Budget (Dec)', '-5M', '', '-2M', '']

I believe the closest you can get (assuming the pattern hold across all your rows) is something like this:

for items in soup.select("tr"):
    row = [item.text.strip()  for item in items.select('td')]+\
          [item['title'] for item in items.select('span[title]')]    
    print(row)

Output:

['All Day', '', 'Holiday', 'French - Flower Festival', 'France']
['01:00', 'AUS', '', 'Australian Budget  (Dec)', '-5M', '', '-2M', '', 'Australia', 'Revised From -3M']

Obviously, you will need to manipulate rows to exclude unwanted elements. For example, to remove empty elements, you can change the last line to read:

print([element for element in row if element.strip()])

which will change the output to:

['All Day', 'Holiday', 'French - Flower Festival', 'France']
['01:00', 'AUS', 'Australian Budget  (Dec)', '-5M', '-2M', 'Australia', 'Revised From -3M']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM