简体   繁体   中英

untagged text extraction with python is not working

I want to extract 1626 from the tag below using python and beautiful soup I have tried this answer Accessing untagged text using beautifulsoup but all I get back is an empty array []

<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
            Laundry Dry Cleaning Equipment
            <br>

            <br>
</h1>

        1626 Total Items
<!-- br-->
<div>...</div>
</div>

how can I extract the number ?

You can loop through the html code and find what you need using regex

import bs4, re

page = """
<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
            Laundry Dry Cleaning Equipment
            <br>

            <br>
</h1>

        1626 Total Items
    5526 Total Items
                    4426 Total Items
<!-- br-->
<div>...</div>
</div>"""

soup = bs4.BeautifulSoup(page, 'lxml')

divs = soup.findAll('div', {'class' : 'columns'})
div= divs[0]    # we only have one div

divtext= str(div).split('\n')   # get div html code and split it's lines
for line in divtext:
    line = line.strip()

    # match wanted pattern
    match = re.match(r'^(\d+)\s*Total Items$', line)

    if match is not None:     #if match found
        print(match.group(1)) # extract the number

I tried to use the same conventions used in this link you attached to your question above.

Hopefully this is what you are looking for.

Code:

data = """
<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
            Laundry Dry Cleaning Equipment
            <br>

            <br>
</h1>

        1626 Total Items
<!-- br-->
<div>...</div>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all(text=True, recursive=True):
    if "Total Items" in i:
       print(str(i).replace(' ', '').replace('TotalItems', ''))

Output:

1626

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM