使用 python3-beautifulsoup3 从 HTML 抓取字符串

Question

I'm trying to get string from a table row using beautifulsoup.我正在尝试使用 beautifulsoup 从表行中获取字符串。 String I want to get are 'SANDAL' and 'SHORTS', from second and third rows.我想得到的字符串是'SANDAL'和'SHORTS'，来自第二行和第三行。 I know this can be solved with regular expression or with string functions but I want to learn beautifulsoup and do as much as possible with beautifulsoup.我知道这可以通过正则表达式或字符串函数来解决，但我想学习 beautifulsoup 并尽可能多地使用 beautifulsoup。

Clipped python code截取 python 代码

    soup=beautifulsoup(page,'html.parser')
    table=soup.find('table')
    row=table.find_next('tr')
    row=row.find_next('tr')

HTML HTML

    <html>
    <body>
    <div id="body">
    <div class="data">
    
    <table id="products">
    
    <tr><td>PRODUCT<td class="ole1">ID<td class="c1">TYPE<td class="ole1">WHEN<td class="ole4">ID<td class="ole4">ID</td></tr>
    <tr><td>SANDAL<td class="ole1">77313<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878717</td></tr>
    <tr><td>SHORTS<td class="ole1">77314<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878718</td></tr>
    
    </table>
    
    </div>
    </div>
    </body>
    </html>

Answer 1

To get text from first column of the table (sans header), you can use this script:要从表格的第一列（无标题）获取文本，您可以使用以下脚本：

from bs4 import BeautifulSoup


txt = '''
    <html>
    <body>
    <div id="body">
    <div class="data">

    <table id="products">

    <tr><td>PRODUCT<td class="ole1">ID<td class="c1">TYPE<td class="ole1">WHEN<td class="ole4">ID<td class="ole4">ID</td></tr>
    <tr><td>SANDAL<td class="ole1">77313<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878717</td></tr>
    <tr><td>SHORTS<td class="ole1">77314<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878718</td></tr>

    </table>

    </div>
    </div>
    </body>
    </html>'''

soup = BeautifulSoup(txt, 'lxml')  # <-- lxml is important here (to parse the HTML code correctly)

for tr in soup.find('table', id='products').find_all('tr')[1:]:  # <-- [1:] because we want to skip the header
    print(tr.td.text)                                            # <-- print contents of first <td> tag

Prints:印刷：

SANDAL
SHORTS

使用 python3-beautifulsoup3 从 HTML 抓取字符串

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-18 23:51:04

使用 python3-beautifulsoup3 从 HTML 抓取字符串

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-18 23:51:04

解决方案1
1 已采纳 2020-07-18 23:51:04