Looking for a way to trim HTML code using terminal commands

Question

I'm trying to learn awk and sed better, to be able to create cross-compatible terminal tools without needing things like PHP, Perl and so on. I'm now trying to clean up a very long string which is basically a part of an HTML document that I've fetched with curl . I'm wondering about the best way to go about this.

Most solutions that I have found are counting on luxuries like static files or structures, but as I'm trying to clean up fetched HTML code I want to be able to assume that the "periphery" of the string can change a lot, both in size and structure. So what I think I need to be able to do is essentially identify HTML tags, as these likely will not change, and extract the data from those HTML tags, no matter where they are. An example could be something like this:

<span class="unique-class">Payload</span>

I need to be able to look for that entire HTML tag, and when it is found, I need to extract basically everything after the > , until a < is found and another tag starts.

Since my original code is basically useless due to the fact that it just grep s lines matching certain words (words that can show up in non-interesting instances on the same page), I'm really open for anything.

Answer 1

You'll very likely need to use Regex to find the string segments you need, sed and awk take Regex as an option, though may require a switch to do so. I recommend looking for the tags as wholes, otherwise you might end up getting code between a closing tag and opening tag ( </span>stuff here<p> ), which you probably don't want.

So, your regexes, at their most basic, might look something like this (not tested, you will probably have to tweak it):

/\<[a-zA-z]\>/ /* Find the opening tag. */ 
/\<[/a-zA-z]\>/ /* Find the closing tag, note the presence of the "/" inside the square brackets.
*/

Depending on your needs, you can create a list of tags to look for, specifically, giving you something like:

tags="div|p|article|section" /* Your list of tags, pipe-delimited for OR logic */
/\<$tags[:print:]\>/ /* The regex, looking for something like <div[anything]> */

You may be able to take it farther by Regexing for the opening tag, storing the base tag in a variable, then finding the matching closing tag. This may take a little more work to get working properly, but it does have the advantage of being more robust and naturally avoids the pitfalls of stopping at the wrong closing tag (ie - stopping at an </a> when it should stop at </p> ).

A couple of notes - this may get a little hairy with some of the single-character tags. If you don't write it intelligently enough, your program may confuse things like <a> and <article> , so make sure your code is robust enough to account for that.

Also, don't forget that <input> s are used for generating most of the different form inputs, so if you care about what those are, make sure to look for the type attribute whenever you run across an <input> .

Finally, you can't necessarily assume that a tag will have a closing tag. Some tags don't have one ( <br/> / <br> , <hr/> / <hr> ) and the HTML specs don't always require them ( <li> and <p> don't require closing tags as long as the next opening tag is another <li> or <p> , or is followed by the parent's closing tag). You also can't assume that the HTML you get will be valid. So make sure to account for these situations, so your application doesn't crash and burn.

Looking for a way to trim HTML code using terminal commands

Question

1 answers

solution1
1 ACCPTED 2013-03-20 13:53:23

Looking for a way to trim HTML code using terminal commands

Question

1 answers

solution1 1 ACCPTED 2013-03-20 13:53:23

solution1
1 ACCPTED 2013-03-20 13:53:23