简体   繁体   English

BeautifulSoup解析HTML表

[英]BeautifulSoup to parse an HTML table

This is my first time using BeautifulSoup and I am trying to parse an HTML table. 这是我第一次使用BeautifulSoup,并且试图解析HTML表。 So far, through other examples, I have been able to write some simple code to get very close to what I need. 到目前为止,通过其他示例,我已经能够编写一些简单的代码来非常接近我的需要。 However, by using the ele.text.strip() , I end up losing part of the information that I want to keep. 但是,通过使用ele.text.strip() ,我最终丢失了部分我想保留的信息。

As seen below, here is what my code looks like now: 如下所示,这是我的代码现在的样子:

soup = BeautifulSoup(open("data_table.htm"))

table = soup.find("div", id="CT_Main_1_divResults")
table_body = table.find('tbody')
rows = table_body.find_all('tr')

data = []
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append(cols)

Results: 结果:

[u'$4,090,000,000',
 u'13.61%',
 u'4,550,000',
 u'100 Grainger Pkwy.',
 u'',
 u'',
 u'']

I thought maybe I could just eliminate the ele.text.strip() line, and use the same code, as seen below: 我以为也许可以消除ele.text.strip()行,并使用相同的代码,如下所示:

data = []
for row in rows:
    cols = row.find_all('td')
    data.append(cols)

Here are the results that provides below: 以下是提供的结果:

[<td><span style="text-align: right; height: 36px;">$4,090,000,000</span></td>,
 <td><span style="text-align: right; height: 36px;">13.61%</span></td>,
 <td><span style="text-align: right; height: 36px;">4,550,000</span></td>,
 <td class=""><span style="text-align: right; height: 36px;">100 Grainger Pkwy.</span></td>,
 <td><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/True.gif"/></span></td>,
 <td><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/cancel.gif"/></span></td>,
 <td class="tdbrdrright"><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/True.gif"/></span></td>]

One way around this might be to use the second option and do some fancy string parsing to grab what I need, but I hope theres a better way. 解决此问题的一种方法可能是使用第二个选项并进行一些精美的字符串解析以获取我需要的内容,但我希望有更好的方法。 In the end, I want the data to look like below. 最后,我希望数据如下所示。 How can I adjust my code to achieve this? 我该如何调整代码以实现这一目标?

[u'$4,090,000,000',
 u'13.61%',
 u'4,550,000',
 u'100 Grainger Pkwy.',
 u'Inside%20the%20Databases.com_files/True.gif',
 u'Inside%20the%20Databases.com_files/calcel.gif',
 u'Inside%20the%20Databases.com_files/True.gif']
import bs4

html = '''<td><span style="text-align: right; height: 36px;">$4,090,000,000</span></td>,
 <td><span style="text-align: right; height: 36px;">13.61%</span></td>,
 <td><span style="text-align: right; height: 36px;">4,550,000</span></td>,
 <td class=""><span style="text-align: right; height: 36px;">100 Grainger Pkwy.</span></td>,
 <td><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/True.gif"/></span></td>,
 <td><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/cancel.gif"/></span></td>,
 <td class="tdbrdrright"><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/True.gif"/></span></td>'''
soup = bs4.BeautifulSoup(html, 'lxml')

for td in soup('td'):
    if td.text:
        print(td.text)
    else:
        print(td.img.get('src'))

out: 出:

$4,090,000,000
13.61%
4,550,000
100 Grainger Pkwy.
Inside%20the%20Databases.com_files/True.gif
Inside%20the%20Databases.com_files/cancel.gif
Inside%20the%20Databases.com_files/True.gif

Change the print to append , and you will get a list of this output. print更改为append ,您将获得此输出的列表。

The missing info you want is in the img tag's attribute, not a text. 您想要的缺少信息在img标签的属性中,而不是文本中。

Give this a try. 试试看。 You'll need to adjust based on what you want to do if there are, say, multiple img tags, or text as well as img tags, etc., but this should get you started down the right path. 如果存在多个img标签或文本以及img标签等,则需要根据要执行的操作进行调整,但这将使您正确地开始。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("data-table.html"), 'html.parser')

table = soup.find("div", id="CT_Main_1_divResults")
table_body = table.find('tbody')
rows = table_body.find_all('tr')

data = []
for row in rows:
    cols = []
    for col in row.find_all('td'):
        t = col.text.strip()
        if not t:
            for img in row.find_all('img'):
                t = img.attrs['src']

        cols.append(t)
    data.append(cols)

print(data)

Output: 输出:

[[u'$4,090,000,000', u'13.61%', u'4,550,000', u'100 Grainger Pkwy.', u'Inside%20the%20Databases.com_files/True.gif', u'Inside%20the%20Databases.com_files/True.gif', u'Inside%20the%20Databases.com_files/True.gif']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM