如何使用 beautifulsoup 從 html 內容中提取標簽

Question

我有類似於以下 html 代碼的內容，我正在嘗試從中解析/提取一些內容。

<div class="row d-3">
    <div class="col-16 col-sm-8">
        <strong>Category</strong> <br>
        // *** extract this text ***
        Clothing</div>
    <div class="col-16 col-sm-8">
        <strong>Sub-category</strong> <br>
          // *** extract this text ***
         this is Sub-category
        </div>
    <div class="col-16 col-sm-8">
        <strong>product</strong> <br>
        // *** extract this text ***
        This is the actual product </div>
</div>

我需要以下內容：

{類別：服裝，子類別：這是子類別，產品：這是實際產品}。

我嘗試了以下方法：

for b in soup.find_all("div", class_="row d-3"):
  print(b.strong.get_text())

但我只能提取Category而不是Clothing 。

Answer 1

如何實現？

您可以使用contents或以下作為生成器的stripped_strings的解決方案

list(b.stripped_strings)

#Output --> ['Category', 'Clothing', 'Sub-category', 'this is Sub-category', 'product', 'This is the actual product']

要將此結果集轉換為 dict，您可以使用：

dict({x for x in zip(s[::2],s[1::2])})

示例：

html = '''
<div class="row d-3">
    <div class="col-16 col-sm-8">
        <strong>Category</strong> <br>
        Clothing</div>
    <div class="col-16 col-sm-8">
        <strong>Sub-category</strong> <br>
         this is Sub-category
        </div>
    <div class="col-16 col-sm-8">
        <strong>product</strong> <br>
        This is the actual product </div>
</div>'''

soup = BeautifulSoup(html, "lxml")

for b in soup.find_all("div", class_="row d-3"):
    s = list(b.stripped_strings)
    print(dict({x for x in zip(s[::2],s[1::2])}))

Output：

{'Category': 'Clothing', 'Sub-category': 'this is Sub-category', 'product': 'This is the actual product'}

如何使用 beautifulsoup 從 html 內容中提取標簽

問題描述

1 個解決方案

解決方案1
0 2022-01-09 10:12:25

如何實現？

示例：

Output：

如何使用 beautifulsoup 從 html 內容中提取標簽

問題描述

1 個解決方案

解決方案1 0 2022-01-09 10:12:25

如何實現？

示例：

Output：

解決方案1
0 2022-01-09 10:12:25