Bash Script parse files for multiple occurrence of string between pattern

Question

I am doing little text processing to find video content in html files uploaded by users. So we have defined a tag called "video" and users are supposed to put the video files like

<video> abcd.mp4 </video>

Presently I am using awk to extract the line which has video tag,

str=$(awk '/<video>/{flag=1;} /<\/video>/{print ;flag=0} flag { print }' file.html)

The output contains the tag too, so I do prefix and suffix removal to get the video file name. Its done like this,

prefix="<video>"
suffix="</video>"              
foo=${str#$prefix}
foo=${foo%$suffix}

But this will only work for files which have video tags used just once. For files with multiple usage of tags the string returned by awk starts from the first occurence of <video> till the last occurence of </video> .

So my question how should I write a script which will at the end of it give me an array of all the strings between <video> and </video> tag. Also how can I change the

<video> abcd.mp4 </video>

to say

<media> abcd.mp4 </media>.

Answer 1

To get each tag by itself:

grep -Eo "<video>(.+?)</video>" myfile.html

To get just the text within the tags:

grep -Eo "<video>(.+?)</video>" myfile.html | sed -E "s|</?video>||g"

If the opening and closing tags are on different lines:

tr "\n" " " < myfile.html | grep -Eo "<video>(.+?)</video>" | sed -E "s|</?video>||g"

Example input:

This is a <video> video1.mp4 </video>  file with <other> <random> </tags>
<media> media1.mp4 </media> 
<video> video2.mp4 </video> 
<media>     media 2 with spaces 
and over 
multiple lines.mp4 </media>

Example output:

video1.mp4 
video2.mp4

To get both video and media tags (please specify in your original question):

tr "\n" " " < vid.html | grep -Eo "<(video|media)>(.+?)</(video|media)>"  | sed -E "s#</?(video|media)>##g"

Output:

 video1.mp4 
 media1.mp4 
 video2.mp4 
 media 2 with spaces      and over      multiple lines.mp4

For your second question, run the whole file through this command:

sed -E "s|(</?)video>|\1media>|g" vid.html

Answer 2

Try this:

$ cat tst.awk
BEGIN{
    stag = "<"  tag ">"
    etag = "</" tag ">"
}

pos = index($0,stag) {
    $0 = substr($0,pos+length(stag)) 
    rec = ""
    inTag = 1
}

inTag {
    if (pos = index($0,etag)) {
        rec = rec substr($0,1,pos-1) 
        gsub(/^[[:space:]]+|[[:space:]]+$/,"",rec)
        print "<" rec ">"
        inTag = 0
    }
    else {
        rec = rec $0 ORS
    }
}
$ 
$ cat file
<video> video1.mp4 </video>
<media> media1.mp4 </media>
<video>
video2.mp4 </video>
<media> media 2 with
spaces and
over multiple
lines.mp4
</media>
$ 
$ awk -v tag="video" -f tst.awk file
<video1.mp4>
<video2.mp4>
$   
$ awk -v tag="media" -f tst.awk file
<media1.mp4>
<media 2 with
spaces and
over multiple
lines.mp4>

Change print "<" rec ">" to just print rec after you understand and are happy with what it's doing.

Bash Script parse files for multiple occurrence of string between pattern

Question

2 answers

solution1
1 ACCPTED 2013-10-07 15:21:51

solution2
1 2013-10-07 15:58:11

Bash Script parse files for multiple occurrence of string between pattern

Question

2 answers

solution1 1 ACCPTED 2013-10-07 15:21:51

solution2 1 2013-10-07 15:58:11

solution1
1 ACCPTED 2013-10-07 15:21:51

solution2
1 2013-10-07 15:58:11