How to find links with all uppercase text using Python (without a 3rd party parser)?

Question

I am using BeautifulSoup in a simple function to extract links that have all uppercase text:

def findAllCapsUrls(page_contents):
    """ given HTML, returns a list of URLs that have ALL CAPS text
    """
    soup = BeautifulSoup.BeautifulSoup(page_contents)
    all_urls = node_with_links.findAll(name='a')

    # if the text for the link is ALL CAPS then add the link to good_urls
    good_urls = []
    for url in all_urls:
        text = url.find(text=True)
        if text.upper() == text:
            good_urls.append(url['href'])

    return good_urls

Works well most of the time, but a handful of pages will not parse correctly in BeautifulSoup (or lxml, which I also tried) due to malformed HTML on the page, resulting in an object with no (or only some) links in it. A "handful" might sound like not-a-big-deal, but this function is being used in a crawler so there could be hundreds of pages that the crawler will never find...

How can the above function be refactored to not use a parser like BeautifulSoup? I've searched around for how to do this using regex, but all the answers say "use BeautifulSoup." Alternatively, I started looking at how to "fix" the malformed HTML so that is parses, but I don't think that is the best route...

What is an alternative solution, using re or something else, that can do the same as the function above?

Answer 1

If the html pages are malformed, there is not a lot of solutions that can really help you. BeautifulSoup or other parsing library are the way to go to parse html files.

If you want to avoir the library path, you could use a regexp to match all your links see regular-expression-to-extract-url-from-an-html-link using a range of [AZ]

Answer 2

When I need to parse a really broken html and speed is not the most important factor I automate a browser with selenium & webdriver .

This is the most resistant way of html parsing I know. Check this tutorial it shows how to extract google suggestion using webdriver (the code is in java but it can be changed to python).

Answer 3

I ended up with a combination of regex and BeautifulSoup:

def findAllCapsUrls2(page_contents):
    """ returns a list of URLs that have ALL CAPS text, given
    the HTML from a page. Uses a combo of RE and BeautifulSoup
    to handle malformed pages.
    """
    # get all anchors on page using regex
    p = r'<a\s+href\s*=\s*"([^"]*)"[^>]*>(.*?(?=</a>))</a>'
    re_urls = re.compile(p, re.DOTALL)
    all_a = re_urls.findall(page_contents)

    # if the text for the anchor is ALL CAPS then add the link to good_urls
    good_urls = []
    for a in all_a:
        href = a[0]
        a_content = a[1]
        a_soup = BeautifulSoup.BeautifulSoup(a_content)
        text = ''.join([s.strip() for s in a_soup.findAll(text=True) if s])
        if text and text.upper() == text:
            good_urls.append(href)

    return good_urls

This is working for my use cases so far, but I wouldn't guarantee it to work on all pages. Also, I only use this function if the original one fails.

How to find links with all uppercase text using Python (without a 3rd party parser)?

Question

3 answers

solution1
3 2010-11-04 13:48:46

solution2
1 2010-11-04 13:44:03

solution3
0 ACCPTED 2010-11-04 14:50:37

How to find links with all uppercase text using Python (without a 3rd party parser)?

Question

3 answers

solution1 3 2010-11-04 13:48:46

solution2 1 2010-11-04 13:44:03

solution3 0 ACCPTED 2010-11-04 14:50:37

solution1
3 2010-11-04 13:48:46

solution2
1 2010-11-04 13:44:03

solution3
0 ACCPTED 2010-11-04 14:50:37