如何使用Python（沒有第3方解析器）查找所有大寫文本的鏈接？

Question

我在一個簡單的函數中使用BeautifulSoup來提取包含所有大寫文本的鏈接：

def findAllCapsUrls(page_contents):
    """ given HTML, returns a list of URLs that have ALL CAPS text
    """
    soup = BeautifulSoup.BeautifulSoup(page_contents)
    all_urls = node_with_links.findAll(name='a')

    # if the text for the link is ALL CAPS then add the link to good_urls
    good_urls = []
    for url in all_urls:
        text = url.find(text=True)
        if text.upper() == text:
            good_urls.append(url['href'])

    return good_urls

大多數情況下都能正常工作，但是由於頁面上HTML格式錯誤，導致少數頁面無法在BeautifulSoup（或lxml，我也嘗試過）中正確解析，從而導致對象中沒有（或只有一些）鏈接。 “少數”聽起來像不是一筆大買賣，但是此功能正在搜尋器中使用，因此可能有數百個頁面搜尋器永遠找不到...

如何將上述函數重構為不使用類似BeautifulSoup的解析器？ 我一直在尋找如何使用正則表達式來執行此操作，但是所有答案都表明“使用BeautifulSoup”。 另外，我開始研究如何“修復”格式錯誤的HTML，以便對其進行解析，但我認為這不是最佳途徑。

有什么其他解決方案，可以使用re或其他方式與上述功能相同？

Answer 1

如果html頁面格式錯誤，則沒有很多可以真正幫助您的解決方案。 BeautifulSoup或其他解析庫是解析html文件的方法。

如果您想引用庫路徑，則可以使用正則表達式來匹配所有鏈接，請參見使用[AZ]范圍的Regular-expression-to-extract-url-from-an-html-link

Answer 2

當我需要解析一個真正損壞的html且速度不是最重要的因素時，我使用selenium＆webdriver自動化了瀏覽器。

這是我所知道的最難的html解析方法。 檢查本教程，它顯示了如何使用WebDriver提取Google建議（代碼在Java中，但可以更改為python）。

Answer 3

我最終得到了正則表達式和BeautifulSoup的組合：

def findAllCapsUrls2(page_contents):
    """ returns a list of URLs that have ALL CAPS text, given
    the HTML from a page. Uses a combo of RE and BeautifulSoup
    to handle malformed pages.
    """
    # get all anchors on page using regex
    p = r'<a\s+href\s*=\s*"([^"]*)"[^>]*>(.*?(?=</a>))</a>'
    re_urls = re.compile(p, re.DOTALL)
    all_a = re_urls.findall(page_contents)

    # if the text for the anchor is ALL CAPS then add the link to good_urls
    good_urls = []
    for a in all_a:
        href = a[0]
        a_content = a[1]
        a_soup = BeautifulSoup.BeautifulSoup(a_content)
        text = ''.join([s.strip() for s in a_soup.findAll(text=True) if s])
        if text and text.upper() == text:
            good_urls.append(href)

    return good_urls

到目前為止，這適用於我的用例，但我不能保證它可以在所有頁面上使用。 另外，僅當原始功能失敗時，我才使用此功能。

如何使用Python（沒有第3方解析器）查找所有大寫文本的鏈接？

問題描述

3 個解決方案

解決方案1
3 2010-11-04 13:48:46

解決方案2
1 2010-11-04 13:44:03

解決方案3
0 已采納 2010-11-04 14:50:37

如何使用Python（沒有第3方解析器）查找所有大寫文本的鏈接？

問題描述

3 個解決方案

解決方案1 3 2010-11-04 13:48:46

解決方案2 1 2010-11-04 13:44:03

解決方案3 0 已采納 2010-11-04 14:50:37

解決方案1
3 2010-11-04 13:48:46

解決方案2
1 2010-11-04 13:44:03

解決方案3
0 已采納 2010-11-04 14:50:37