[英]BeautifulSoup help, how to extract content from not proper tags text in html file?
[英]How to extract tags from html content using beautifulsoup
我有類似於以下 html 代碼的內容,我正在嘗試從中解析/提取一些內容。
<div class="row d-3">
<div class="col-16 col-sm-8">
<strong>Category</strong> <br>
// *** extract this text ***
Clothing</div>
<div class="col-16 col-sm-8">
<strong>Sub-category</strong> <br>
// *** extract this text ***
this is Sub-category
</div>
<div class="col-16 col-sm-8">
<strong>product</strong> <br>
// *** extract this text ***
This is the actual product </div>
</div>
我需要以下內容:
{類別:服裝,子類別:這是子類別,產品:這是實際產品}。
我嘗試了以下方法:
for b in soup.find_all("div", class_="row d-3"):
print(b.strong.get_text())
但我只能提取Category
而不是Clothing
。
您可以使用contents
或以下作為生成器的stripped_strings
的解決方案
list(b.stripped_strings)
#Output --> ['Category', 'Clothing', 'Sub-category', 'this is Sub-category', 'product', 'This is the actual product']
要將此結果集轉換為 dict,您可以使用:
dict({x for x in zip(s[::2],s[1::2])})
html = '''
<div class="row d-3">
<div class="col-16 col-sm-8">
<strong>Category</strong> <br>
Clothing</div>
<div class="col-16 col-sm-8">
<strong>Sub-category</strong> <br>
this is Sub-category
</div>
<div class="col-16 col-sm-8">
<strong>product</strong> <br>
This is the actual product </div>
</div>'''
soup = BeautifulSoup(html, "lxml")
for b in soup.find_all("div", class_="row d-3"):
s = list(b.stripped_strings)
print(dict({x for x in zip(s[::2],s[1::2])}))
{'Category': 'Clothing', 'Sub-category': 'this is Sub-category', 'product': 'This is the actual product'}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.