简体   繁体   中英

regular expression : ignore html tags

I have HTML content like this:

<p>The bedding was hardly <strong>able to cover</strong> it and seemed ready to slide off any moment.</p>

Here's a complete version of the HTML. http://collabedit.com/gkuc2

I need to search the string hardly able to cover (just an example), I want to ignore any HTML tags inside the string I'm looking for. Because in the HTML file there's HTML tags inside the string and a simple search won't find it.

The use case is: I have two versions of a file:

  • An HTML file with text and tags
  • The same file but with the raw text only (removed any tags and extra spaces)

The sub-string that I want to search (the needle) is from the text version (that doesn't contain any HTML tag) and I want to find it's position in the HTML version (the file that has tags).

What is the regular expression that would work?

Put this between each letter:

(?:<[^>]+>)*

and replace the spaces with:

(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*

Like:

h(?:<[^>]+>)*a(?:<[^>]+>)*r(?:<[^>]+>)*d(?:<[^>]+>)*l(?:<[^>]+>)*y(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*a(?:<[^>]+>)*b(?:<[^>]+>)*l(?:<[^>]+>)*e(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*t(?:<[^>]+>)*o(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*c(?:<[^>]+>)*o(?:<[^>]+>)*v(?:<[^>]+>)*e(?:<[^>]+>)*r

you only need the ones between each letter if you want to allow tags to break words, like: This is b<b>old</b>

This is it without the letter break:

hardly(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*able(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*to(?:\s*<[^>]+>\s*)*\s+(?:\s*<[^>]+>\s*)*cover

This should work for most cases. However, if the Html is malformed in which the < or > is not htmlencoded, you may run into issues. Also it may break on script blocks or other elements with CDATA sections.

Try to save the text in a variable or something, then remove all the tags and perform a normal search in that. You can use a simple php function strip_tags() .

EDIT: So you might try to look for the first and last words (or just first and then play with the rest of the result) to locate the string, then parse the result and remove tags and check if it's the one you're looking for. Like using regex: hardly. cover or even hardly. $ And saving the location of each result. Then use strip_tags() on the results and analyze each result if it's the one you want. I know it's kinda weird solution but you can avoid endless regex etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM