I want to extract 1626 from the tag below using python and beautiful soup I have tried this answer Accessing untagged text using beautifulsoup but all I get back is an empty array []
<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
Laundry Dry Cleaning Equipment
<br>
<br>
</h1>
1626 Total Items
<!-- br-->
<div>...</div>
</div>
how can I extract the number ?
You can loop through the html code and find what you need using regex
import bs4, re
page = """
<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
Laundry Dry Cleaning Equipment
<br>
<br>
</h1>
1626 Total Items
5526 Total Items
4426 Total Items
<!-- br-->
<div>...</div>
</div>"""
soup = bs4.BeautifulSoup(page, 'lxml')
divs = soup.findAll('div', {'class' : 'columns'})
div= divs[0] # we only have one div
divtext= str(div).split('\n') # get div html code and split it's lines
for line in divtext:
line = line.strip()
# match wanted pattern
match = re.match(r'^(\d+)\s*Total Items$', line)
if match is not None: #if match found
print(match.group(1)) # extract the number
I tried to use the same conventions used in this link you attached to your question above.
Hopefully this is what you are looking for.
Code:
data = """
<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
Laundry Dry Cleaning Equipment
<br>
<br>
</h1>
1626 Total Items
<!-- br-->
<div>...</div>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all(text=True, recursive=True):
if "Total Items" in i:
print(str(i).replace(' ', '').replace('TotalItems', ''))
Output:
1626
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.