简体   繁体   中英

Regex to match attributes in HTML?

I have a txt file which actually is a html source of some webpage. Inside that txt file there are various strings preceded by a "title=" tag. eg

<div id='UWTDivDomains_5_6_2_2'  title='Connectivity Framework'> 

I am interested in getting the text Connectivity Framework to be extraced and written to a separate file.

Like this, there are many such tags each having a different text after the title='some text here which i need to extract ' I want to extract all such instances of the text from the html source/txt file and write to a separate txt file. The text can contain lower case, upper case letters and number only. The length of each text string(in characters) will vary.

I am using PowerGrep for windows. Powergrep allows me to search a text file with regular expression inout. I tried using the search as title='[a-zA-Z0-9]

It shows the correct matches, but it matches only first character of the string and writes only the first character of the text string matched to the second txt file, not all string.

I want all string to be matched and written to the second file.

What is the correct regular expression or way to do what i want to do, using powergrep?

-AD.

I'm just not sure how many times the question of regular expression parsing of HTML files has to be asked (and answered with the correct solution of "use a DOM parser"). It comes up every day.

The difficulties are:

  • In HTML attributes can have single-quotes, double-quotes or even no quotes;
  • Similar strings can appear in the HTML document itself;
  • You have to handle correct escaping; and
  • Malformed HTML (decent parsers are extremely robust to common errors).

So if you cater for all this (and it gets to be a pretty complicated yet still imperfect regex), it's still not 100%.

HTML parsers exist for a reason. Use them.

The other answers all give correct changes to the regex, so I'll explain what the issue was with your original.

The square brackets indicate a character class - meaning that the regex will match any character within those brackets. However, like everything else, it will only match it once by default. Just as the regex " s " would match only the first character in " ssss ", the regex " [a-zA-Z0-9] " will match only the first character in " Connectivity Framework ".

By adding repetition , one can get that character class to match repeatedly. The easiest way to do this is by adding an asterisk after it (which will match 0 or more occurences). Thus the regex " [a-zA-Z0-9] *" will match as many characters in a row until it hits a character that is not in that character class (in your case, the space character since you didn't include that in your brackets).

Regexes though can be pretty complex to describe the syntax accurately - what if someone put a non-alphanumeric character such as an ampersand within the attribute? You could try to capture all input between the quotes by making the character set "anything except a quote character", so " '[^']*' " would usually do the right thing. Often you need to bear in mind escaping as well (eg with a string 'Mary\\'s lamb' you do actually want to capture the apostrophe in the middle so a simple "everything but apostrophes" character set won't cut it) though thankfully this is not an issue with XML/HTML according to the specs.

Still, if there is an existing library available that will do the extraction for you, this is likely to be faster and more correct than rolling your own, so I would lean towards that if possible.

I'm not familiar with PowerGrep, however, your regex is incomplete. Try this:

title='[a-zA-Z0-9 ]*'

or better yet:

title='([^']*)'

I would use this regular expression to get the title attribute values

<[a-z]+[^>]*\s+title\s*=\s*("[^"]*"|'[^']*'|[^\s >]*)

Note that this regex matches the attribute value expression with quotes. So you have to remove them if needed.

Here's the regex you need

title='([a-zA-Z0-9]+)'

but if you're going to be doing a lot more stuff like this, using a parser might make it much more robust and useful.

尝试以下方法:

title=\'[a-zA-Z0-9]*\'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM