简体   繁体   English

Python-正则表达式输出最后一次出现[HTML Scraping]

[英]Python- Regular Expression outputting last occurrence [HTML Scraping]

I'm web scraping from a local archive of techcrunch.com. 我是从techcrunch.com的本地档案网上抓取的。 I'm using regex to sort through and grab every heading for each article, however my output continues to remain as the last occurrence. 我正在使用正则表达式进行排序并获取每篇文章的每个标题,但是我的输出仍然是最后一次出现。

def extractNews():
selection = listbox.curselection()

if selection == (0,):
    # Read the webpage:
    response = urlopen("file:///E:/University/IFB104/InternetArchive/Archives/Sun,%20October%201st,%202017.html")
    html = response.read()

    match = findall((r'<h2 class="post-title"><a href="(.*?)".*>(.*)</a></h2>'), str(html)) # use [-2] for position after )


    if match:
        for link, title in match:
            variable = "%s" % (title)


    print(variable)

and the current output is 而目前的输出是

Heetch raises $12 million to reboot its ridesharing service Heetch筹集了1200万美元来重新启动其共享服务

which is the last heading of the entire webpage, as seen in the image below (last occurrence) 这是整个网页的最后一个标题,如下图所示(最后一次出现)

The website/image looks like this and each article block consists of the same code for the heading: 网站/图像看起来像这样 ,每个文章块都包含相同的标题代码:

<h2 class="post-title"><a href="https://web.archive.org/web/20171001000310/https://techcrunch.com/2017/09/29/heetch-raises-12-million-to-reboot-its-ride-sharing-service/" data-omni-sm="gbl_river_headline,20">Heetch raises $12 million to reboot its ridesharing service</a></h2>

I cannot see why it keeps resulting to this last match. 我不明白为什么它会导致最后一场比赛。 I have ran it through websites such as https://regex101.com/ and it tells me that I only have one match which is not the one being outputted in my program. 我通过https://regex101.com/等网站运行它,它告诉我,我只有一个匹配,而不是我的程序中输出的匹配。 Any help would be greatly appreciated. 任何帮助将不胜感激。

EDIT: If anyone is aware of a way to display each matched result SEPARATELY between different <h1></h1> tags when writing to a .html file, it would mean a lot :) I am not sure if this is right but I think you use [-#] for the position/match being referred too? 编辑:如果有人知道在写入.html文件时在不同的<h1></h1>标签之间单独显示每个匹配结果的方法,那将意味着很多:)我不确定这是否正确但我认为你也使用[ - #]作为位置/匹配?

The regex is fine, but your problem is in the loop here. 正则表达式很好,但你的问题在这里循环。

if match:
 for link, title in match:
  variable = "%s" % (title)

Your variable is overwritten in each iteration. 每次迭代都会覆盖您的变量。 That's why you only see the its value for the last iteration of the loop. 这就是为什么你只看到循环的最后一次迭代的值。

You could do something along these lines: 你可以沿着这些方向做点什么:

if match:
 variableList = []
 for link, title in match:
  variable = "%s" % (title)
  variableList.append(variable)

print variableList 

Also, generally, I would recommend against using regex to parse html (as per the famous answer ). 另外,一般来说,我建议不要使用正则表达式来解析html(根据着名的答案 )。

If you haven't already familiarised yourself with BeautifulSoup, you should. 如果您还没有熟悉BeautifulSoup,那么您应该这样做。 Here is a non-regex solution using BeautifulSoup to dig out all h2 post-titles from your html page. 这是一个非正则表达式解决方案,使用BeautifulSoup从您的html页面中挖掘所有h2后标题。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.findAll('h2', {'class':'post-title'})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM