简体   繁体   中英

Bash Script parse files for multiple occurrence of string between pattern

I am doing little text processing to find video content in html files uploaded by users. So we have defined a tag called "video" and users are supposed to put the video files like

<video> abcd.mp4 </video>

Presently I am using awk to extract the line which has video tag,

str=$(awk '/<video>/{flag=1;} /<\/video>/{print ;flag=0} flag { print }' file.html)

The output contains the tag too, so I do prefix and suffix removal to get the video file name. Its done like this,

prefix="<video>"
suffix="</video>"              
foo=${str#$prefix}
foo=${foo%$suffix}

But this will only work for files which have video tags used just once. For files with multiple usage of tags the string returned by awk starts from the first occurence of <video> till the last occurence of </video> .

So my question how should I write a script which will at the end of it give me an array of all the strings between <video> and </video> tag. Also how can I change the

<video> abcd.mp4 </video>

to say

<media> abcd.mp4 </media>.

To get each tag by itself:

grep -Eo "<video>(.+?)</video>" myfile.html

To get just the text within the tags:

grep -Eo "<video>(.+?)</video>" myfile.html | sed -E "s|</?video>||g"

If the opening and closing tags are on different lines:

tr "\n" " " < myfile.html | grep -Eo "<video>(.+?)</video>" | sed -E "s|</?video>||g"

Example input:

This is a <video> video1.mp4 </video>  file with <other> <random> </tags>
<media> media1.mp4 </media> 
<video> video2.mp4 </video> 
<media>     media 2 with spaces 
and over 
multiple lines.mp4 </media>

Example output:

video1.mp4 
video2.mp4 

To get both video and media tags (please specify in your original question):

tr "\n" " " < vid.html | grep -Eo "<(video|media)>(.+?)</(video|media)>"  | sed -E "s#</?(video|media)>##g"

Output:

 video1.mp4 
 media1.mp4 
 video2.mp4 
 media 2 with spaces      and over      multiple lines.mp4 

For your second question, run the whole file through this command:

sed -E "s|(</?)video>|\1media>|g" vid.html

Try this:

$ cat tst.awk
BEGIN{
    stag = "<"  tag ">"
    etag = "</" tag ">"
}

pos = index($0,stag) {
    $0 = substr($0,pos+length(stag)) 
    rec = ""
    inTag = 1
}

inTag {
    if (pos = index($0,etag)) {
        rec = rec substr($0,1,pos-1) 
        gsub(/^[[:space:]]+|[[:space:]]+$/,"",rec)
        print "<" rec ">"
        inTag = 0
    }
    else {
        rec = rec $0 ORS
    }
}
$ 
$ cat file
<video> video1.mp4 </video>
<media> media1.mp4 </media>
<video>
video2.mp4 </video>
<media> media 2 with
spaces and
over multiple
lines.mp4
</media>
$ 
$ awk -v tag="video" -f tst.awk file
<video1.mp4>
<video2.mp4>
$   
$ awk -v tag="media" -f tst.awk file
<media1.mp4>
<media 2 with
spaces and
over multiple
lines.mp4>

Change print "<" rec ">" to just print rec after you understand and are happy with what it's doing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM