Python-正則表達式輸出最后一次出現[HTML Scraping]

Question

我是從techcrunch.com的本地檔案網上抓取的。 我正在使用正則表達式進行排序並獲取每篇文章的每個標題，但是我的輸出仍然是最后一次出現。

def extractNews():
selection = listbox.curselection()

if selection == (0,):
    # Read the webpage:
    response = urlopen("file:///E:/University/IFB104/InternetArchive/Archives/Sun,%20October%201st,%202017.html")
    html = response.read()

    match = findall((r'<h2 class="post-title"><a href="(.*?)".*>(.*)</a></h2>'), str(html)) # use [-2] for position after )


    if match:
        for link, title in match:
            variable = "%s" % (title)


    print(variable)

而目前的輸出是

Heetch籌集了1200萬美元來重新啟動其共享服務

這是整個網頁的最后一個標題，如下圖所示（最后一次出現）

網站/圖像看起來像這樣，每個文章塊都包含相同的標題代碼：

<h2 class="post-title"><a href="https://web.archive.org/web/20171001000310/https://techcrunch.com/2017/09/29/heetch-raises-12-million-to-reboot-its-ride-sharing-service/" data-omni-sm="gbl_river_headline,20">Heetch raises $12 million to reboot its ridesharing service</a></h2>

我不明白為什么它會導致最后一場比賽。 我通過https://regex101.com/等網站運行它，它告訴我，我只有一個匹配，而不是我的程序中輸出的匹配。 任何幫助將不勝感激。

編輯：如果有人知道在寫入.html文件時在不同的<h1></h1>標簽之間單獨顯示每個匹配結果的方法，那將意味着很多:)我不確定這是否正確但我認為你也使用[ - ＃]作為位置/匹配？

Answer 1

正則表達式很好，但你的問題在這里循環。

if match:
 for link, title in match:
  variable = "%s" % (title)

每次迭代都會覆蓋您的變量。 這就是為什么你只看到循環的最后一次迭代的值。

你可以沿着這些方向做點什么：

if match:
 variableList = []
 for link, title in match:
  variable = "%s" % (title)
  variableList.append(variable)

print variableList

另外，一般來說，我建議不要使用正則表達式來解析html（根據着名的答案）。

如果您還沒有熟悉BeautifulSoup，那么您應該這樣做。 這是一個非正則表達式解決方案，使用BeautifulSoup從您的html頁面中挖掘所有h2后標題。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.findAll('h2', {'class':'post-title'})

Python-正則表達式輸出最后一次出現[HTML Scraping]

問題描述

1 個解決方案

解決方案1
0 已采納 2017-10-12 10:37:53

Python-正則表達式輸出最后一次出現[HTML Scraping]

問題描述

1 個解決方案

解決方案1 0 已采納 2017-10-12 10:37:53

解決方案1
0 已采納 2017-10-12 10:37:53