简体   繁体   English

使用 python 中的美丽汤保存 html 表中的所有列

[英]Saving all the columns from a html table using beautiful soup in python

I have two kinds of rows I am trying to convert to a table from a website.我尝试将两种行从网站转换为表格。

The first one looks like this,第一张长这样

<tr id="eventRowId_750"> <td class="first left">All Day</td> <td class="flagCur left"><span class="ceFlags France float_lang_base_1" data-img_key="France" title="France"> </span></td> <td class="left textNum sentiment"><span class="bold">Holiday</span></td> <td class="left event" colspan="6">French - Flower Festival</td> </tr>

The second kind of row looks like this,第二种行看起来像这样,

<tr class="js-event-item revised" data-event-datetime="2022/02/02 01:00:00" event_attr_id="114" id="eventRowId_444333"> <td class="first left time js-time" title="">01:00</td> <td class="left ImageCur noWrap"><span class="ceImages Australia" data-img_key="Australia" title="Australia"> </span> AUS</td> <td class="left textNum sentiment noWrap" data-img_key="bull1" title="Low Impact"><i class="grayFullBullishIcon"></i><i class="grayEmptyBullishIcon"></i><i class="grayEmptyBullishIcon"></i></td> <td class="left event" title="Click to view more info on Australian Budget"><a href="australian-budget-114" target="_blank">      Australian Budget  (Dec)</a> </td> <td class="bold act blackFont event-444333-actual" id="eventActual_444333" title="">-5M</td> <td class="fore event-444333-forecast" id="eventForecast_444333"> </td> <td class="prev greenFont event-444333-previous" id="eventPrevious_444333"><span title="Revised From -3M">-2M</span></td> <td class="alert js-injected-user-alert-container" data-event-id="114" data-name="Australian Budget" data-status-enabled="0"> <span class="js-plus-icon alertBellGrayPlus genToolTip oneliner" data-tooltip="Create Alert" data-tooltip-alt="Alert is active"></span> </td> </tr>

I am trying to convert them into rows using python and beautifulsoup.我正在尝试使用 python 和 beautifulsoup 将它们转换为行。 I use the following code,我使用以下代码,

for items in soup.select("tr"):
    data = [item.get_text(strip=True) for item in items.select("th,td")]
    print(data)

But my output looks like this,但是我的 output 看起来像这样,

['All Day', '', 'Holiday', 'French - Flower Festival']
['01:00', 'AUS', '', 'Australian Budget  (Dec)', '-5M', '', '-2M', '']

How can I make I get the "low impact" text into the third column where "holiday" is in the first column, and saving the name "France" in the first row into the second column and make it look like this?如何让我将“低影响”文本放入第一列中“假期”所在的第三列,并将第一行中的名称“法国”保存到第二列中并使其看起来像这样?

['All Day', 'France', 'Holiday', 'French - Flower Festival']
['01:00', 'AUS', 'Low Impact', 'Australian Budget  (Dec)', '-5M', '', '-2M', '']

This part is not really important but, is it possible to save the span title if it exists in by adding it to the end of the list?这部分并不是很重要,但是,如果跨度标题存在,是否可以通过将其添加到列表末尾来保存它? the part where it says, "Revised From -3M".它说“从-3M修订”的部分。 So it could look like this,所以它可能看起来像这样,

['All Day', '', 'Holiday', 'French - Flower Festival']
['02:45', 'AUS', 'Low Impact', 'Australian Budget  (Dec)', '-5M', '', '-2M', '', "Revised From -3M"]

It's unlikely to find a proper pattern, so here we go.不太可能找到合适的模式,所以这里我们使用 go。 I couldn't think of anything to get the title except regex, because it's not tied to a determined tag.除了正则表达式,我想不出任何东西来获得标题,因为它没有绑定到确定的标签。

from bs4 import BeautifulSoup
import re

with open("example.html") as html_doc:
    soup = BeautifulSoup(html_doc, "html.parser")

for items in soup.select("tr"):
    row = []
    for item in items.select("th,td"):
        text = item.get_text(strip=True)
        if not text:
            title = re.search(r"title=\"(.*?)\"", str(item))
            if title:
                text = title.group(1)
        row.append(text)
    print(row)
# output
['All Day', 'France', 'Holiday', 'French - Flower Festival']
['01:00', 'AUS', 'Low Impact', 'Australian Budget (Dec)', '-5M', '', '-2M', '']

I believe the closest you can get (assuming the pattern hold across all your rows) is something like this:我相信你能得到的最接近的(假设你的所有行都保持模式)是这样的:

for items in soup.select("tr"):
    row = [item.text.strip()  for item in items.select('td')]+\
          [item['title'] for item in items.select('span[title]')]    
    print(row)

Output: Output:

['All Day', '', 'Holiday', 'French - Flower Festival', 'France']
['01:00', 'AUS', '', 'Australian Budget  (Dec)', '-5M', '', '-2M', '', 'Australia', 'Revised From -3M']

Obviously, you will need to manipulate rows to exclude unwanted elements.显然,您将需要操作行以排除不需要的元素。 For example, to remove empty elements, you can change the last line to read:例如,要删除空元素,您可以将最后一行更改为:

print([element for element in row if element.strip()])

which will change the output to:这会将 output 更改为:

['All Day', 'Holiday', 'French - Flower Festival', 'France']
['01:00', 'AUS', 'Australian Budget  (Dec)', '-5M', '-2M', 'Australia', 'Revised From -3M']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM