简体   繁体   中英

Python- Regular Expression outputting last occurrence [HTML Scraping]

I'm web scraping from a local archive of techcrunch.com. I'm using regex to sort through and grab every heading for each article, however my output continues to remain as the last occurrence.

def extractNews():
selection = listbox.curselection()

if selection == (0,):
    # Read the webpage:
    response = urlopen("file:///E:/University/IFB104/InternetArchive/Archives/Sun,%20October%201st,%202017.html")
    html = response.read()

    match = findall((r'<h2 class="post-title"><a href="(.*?)".*>(.*)</a></h2>'), str(html)) # use [-2] for position after )


    if match:
        for link, title in match:
            variable = "%s" % (title)


    print(variable)

and the current output is

Heetch raises $12 million to reboot its ridesharing service

which is the last heading of the entire webpage, as seen in the image below (last occurrence)

The website/image looks like this and each article block consists of the same code for the heading:

<h2 class="post-title"><a href="https://web.archive.org/web/20171001000310/https://techcrunch.com/2017/09/29/heetch-raises-12-million-to-reboot-its-ride-sharing-service/" data-omni-sm="gbl_river_headline,20">Heetch raises $12 million to reboot its ridesharing service</a></h2>

I cannot see why it keeps resulting to this last match. I have ran it through websites such as https://regex101.com/ and it tells me that I only have one match which is not the one being outputted in my program. Any help would be greatly appreciated.

EDIT: If anyone is aware of a way to display each matched result SEPARATELY between different <h1></h1> tags when writing to a .html file, it would mean a lot :) I am not sure if this is right but I think you use [-#] for the position/match being referred too?

The regex is fine, but your problem is in the loop here.

if match:
 for link, title in match:
  variable = "%s" % (title)

Your variable is overwritten in each iteration. That's why you only see the its value for the last iteration of the loop.

You could do something along these lines:

if match:
 variableList = []
 for link, title in match:
  variable = "%s" % (title)
  variableList.append(variable)

print variableList 

Also, generally, I would recommend against using regex to parse html (as per the famous answer ).

If you haven't already familiarised yourself with BeautifulSoup, you should. Here is a non-regex solution using BeautifulSoup to dig out all h2 post-titles from your html page.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.findAll('h2', {'class':'post-title'})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM