简体   繁体   中英

RegEx for extracting part of HTML elements

so I'm trying to pull events from this website https://www.oldmuseum.org/ using a regexp tester. It's working, but I also get receive events that are sold out.

This is the regular expression I'm trying to use.

summary-title-link">([^>]+(?!SOLD OUT))<

Produced output:

'An Evening with Sun Kil Moon'
'Amity Dry- Fortified'
'Teeny Tiny Stevies - SOLD OUT'
'Cine Retro '

I'm trying to not get the sold out event. I'm not sure how to fix this regular expression.

If only the SOLD OUT text is undesired, we could add a simple right boundary next to that, something similar to:

 summary-title-link">(.+?)(?: - SOLD OUT)<

The first capturing group $1 is our desired title link, followed by an optional - SOLD OUT .

在此处输入图片说明

RegEx

If this expression wasn't desired, it can be modified or changed in regex101.com .

RegEx Circuit

jex.im also helps to visualize the expressions.

在此处输入图片说明

Demo

 const regex = /summary-title-link">(.+?)(- SOLD OUT)?</gm; const str = `<a href="/event/bpo29sept" class="summary-title-link">Brisbane Philharmonic Orchestra - SOLD OUT</a> <a href="/event/bpo29sept" class="summary-title-link">Brisbane Philharmonic Orchestra - SOLD OUT</a> <a href="/event/bpo29sept" class="summary-title-link">Brisbane Philharmonic Orchestra - SOLD OUT</a> <a href="/event/bpo29sept" class="summary-title-link">Brisbane Philharmonic Orchestra - (Some other data)</a>`; let m; while ((m = regex.exec(str)) !== null) { // This is necessary to avoid infinite loops with zero-width matches if (m.index === regex.lastIndex) { regex.lastIndex++; } // The result can be accessed through the `m`-variable. m.forEach((match, groupIndex) => { console.log(`Found match, group ${groupIndex}: ${match}`); }); } 

If SOLD OUT elements are not fully desired, we can simply remove them using an expression similar to:

summary-title-link">(((?!SOLD OUT)[\s\S])*?)<\/

在此处输入图片说明

Demo

JavaScript Test

 const regex = /summary-title-link">(((?!SOLD OUT)[\\s\\S])*?)<\\//gm; const str = `summary-title-link">Brisbane Philharmonic Orchestra - (Some other data)</a> summary-title-link">Brisbane Philharmonic Orchestra - SOLD OUT</a> summary-title-link">Brisbane Philharmonic Orchestra - SOLD OUT</a> summary-title-link">Brisbane Philharmonic Orchestra - (Some other data)</a> summary-title-link">Brisbane Philharmonic Orchestra - SOLD OUT</a> summary-title-link">Brisbane Philharmonic Orchestra - (Some other data)</a> summary-title-link">Brisbane Philharmonic Orchestra - (Some other data)</a>`; let m; while ((m = regex.exec(str)) !== null) { // This is necessary to avoid infinite loops with zero-width matches if (m.index === regex.lastIndex) { regex.lastIndex++; } // The result can be accessed through the `m`-variable. m.forEach((match, groupIndex) => { console.log(`Found match, group ${groupIndex}: ${match}`); }); } 

Just say I don't want SOLD OUT exists in my string.

summary-title-link">(((?!SOLD OUT).)+)<

following this pattern, we are saying any character which is not SOLD OUT ending with a < .

Demo

Reason

The problem here is that, as a greedy quantifier, [^>]+ will not only match the content we want (eg "Teeny Tiny Stevies"), but also match the flag, "SOLD OUT", which we use to identify the unwanted item.

Thus, when it comes to (?!SOLD OUT) 's turn, it meets the end of the string (which is $ ), which indeed is not "SOLD OUT", meaning it's a match.

Take 'Teeny Tiny Stevies - SOLD OUT' as an example. The process is as follow:

  1. [^>]+ : Match as many [^>] as possible, so match the whole string, 'Teeny Tiny Stevies - SOLD OUT'.
  2. (?!SOLD OUT) : Match a position that don't have "SOLD OUT" following, and the end of the string, $ , indeed matches.

Solution

Unfortunately, I can't give a solution that can do what we want with only one regular expression. I think this may be the limitation of the regular expression: because it matches from left to right, looking ahead maybe just not its strengths.

But, we can solve the problem with two regex: one for including, one for excluding.

  1. >([^>]+)< : This regex get the items, though some of them are not wanted.
  2. if item matches SOLD OUT$ , kick it out.

There may be a better solution. Hope this can help you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM