简体   繁体   English

RegEx用于匹配HTML的特定元素

[英]RegEx for matching specific element of HTML

I am working on a Python code which extracts specific elements from websites and the print it on a GUI implemented through the tkinter module. 我正在研究一个Python代码,该代码从网站提取特定元素并将其打印在通过tkinter模块实现的GUI上。 To extract specific elements from a webpage require the use of regex to which I am currently new and though I am able to obtain various elements, I am still finding it difficult to extract certain elements. 要从网页中提取特定元素,需要使用我目前不熟悉的正则表达式,尽管我能够获得各种元素,但我仍然很难提取某些元素。 One such example is presented below. 下面给出一个这样的例子。

<div class="updated published time-details"><a class="url" 
    href="https://thetriffid.com.au/gig/chocolate-starfish-one-last-kick/" 
    title="CHOCOLATE STARFISH (AUS) &#8220;ONE LAST KICK&#8221;" 
    rel="bookmark"><span class="tribe-event-date-start">Sat Aug 3 @ 8:00 
    pm</span>
    </a>
</div>

This is a part of HTML code from which I just need the title ie "Chocolate Starfish (AUS) & One Last Kick". 这是HTML代码的一部分,我只需要其中的标题即“巧克力海星(AUS)和最后一脚”。 I am using the findall method and we are not allowed to use another external library such as Beautiful Soup. 我使用的是findall方法,不允许使用其他外部库,例如Beautiful Soup。 So, we have to work with findall, finditer, MULTILINE and DOTALL. 因此,我们必须使用findall,finditer,MULTILINE和DOTALL。

How do I get the desired outcome? 我如何获得理想的结果?

Using an HTML-aware solution like BeautifulSoup would handle more cases, but if you're sure the HTML will always conform to your example, you can use a rough regex match like: 使用像BeautifulSoup这样的可BeautifulSoup HTML的解决方案可以处理更多情况,但是如果您确定HTML将始终符合您的示例,则可以使用大致的正则表达式匹配,例如:

re.findall('<a.*? title=\"(.*?)\"', html, re.DOTALL)
# ['CHOCOLATE STARFISH (AUS) &#8220;ONE LAST KICK&#8221;']

This is a good regex to find 'a' tags with 'title' attribute which is in Group 2. 这是一个很好的正则表达式,可以找到组2中具有'title'属性的'a'标签。

Stringed

r"(?si)<a(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?\stitle\s*=\s*(['\"])(.*?)\1)(?:\".*?\"|'.*?'|[^>]*?)+>"

Readable version 可读版本

 (?si)

 <a
 (?=
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      \s title \s* = \s* 
      ( ['"] )                      # (1)
      ( .*? )                       # (2)
      \1 
 )
 (?: " .*? " | ' .*? ' | [^>]*? )+
 >

Benchmark using a large web page (cnn.com) and 300 iterations 使用大型网页(cnn.com)和300次迭代进行基准测试

Regex1:   (?si)<a(?=(?:[^>"']|"[^"]*"|'[^']*')*?\stitle\s*=\s*(['"])(.*?)\1)(?:".*?"|'.*?'|[^>]*?)+>
Options:  < none >
Completed iterations:   300  /  300     ( x 1 )
Matches found per iteration:   285
Elapsed Time:    3.26 s,   3262.08 ms,   3262081 µs
Matches per sec:   26,210

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM