sed - extract specific characters from a string

Question

So I have some unclean HTML:

"<table class="content divbackground"><tr><td class='title'>&nbsp;</td><td class='title'>From</td><td class='title'>To</td></tr><tr><td class='entry'>Monday</td><td class='entry'>09:00</td><td class='entry'>18:00</td></tr><tr><td class='entry'>Tuesday</td><td class='entry'>09:00</td><td class='entry'>18:00</td></tr><tr><td class='entry'>Wednesday</td><td class='entry'>09:00</td><td class='entry'>18:00</td></tr><tr><td class='entry'>Thursday</td><td class='entry'>09:00</td><td class='entry'>20:00</td></tr><tr><td class='entry'>Friday</td><td class='entry'>09:00</td><td class='entry'>20:00</td></tr><tr><td class='entry'>Saturday</td><td class='entry'>09:00</td><td class='entry'>18:00</td></tr><tr><td class='entry'>Sunday</td><td class='entry'>11:00</td><td class='entry'>18:00</td></tr></table></td></td>"

It's the opening hours of a pharmacy (the information is published on a public register).

Now I could parse the HTML using a parser, but I find that this is not robust to errors and I still have to pull out the code between <table> and </table> .

Is there some nice unix command (sed?) that searches for all occurances of:

XX:XX

inside <td></td> tags

where X must be a number?

Answer 1

handle html with regex is not the good practice. however if your input format is fixed, you can try this grep line:

 grep -oP '<td[^>]*>\K\d\d:\d\d' input

with your example input, it outputs:

sed - extract specific characters from a string

Question

1 answers

solution1
2 ACCPTED 2015-04-02 08:35:05

sed - extract specific characters from a string

Question

1 answers

solution1 2 ACCPTED 2015-04-02 08:35:05

solution1
2 ACCPTED 2015-04-02 08:35:05