I am doing little text processing to find video content in html files uploaded by users. So we have defined a tag called "video" and users are supposed to put the video files like
<video> abcd.mp4 </video>
Presently I am using awk to extract the line which has video tag,
str=$(awk '/<video>/{flag=1;} /<\/video>/{print ;flag=0} flag { print }' file.html)
The output contains the tag too, so I do prefix and suffix removal to get the video file name. Its done like this,
prefix="<video>"
suffix="</video>"
foo=${str#$prefix}
foo=${foo%$suffix}
But this will only work for files which have video tags used just once. For files with multiple usage of tags the string returned by awk starts from the first occurence of <video>
till the last occurence of </video>
.
So my question how should I write a script which will at the end of it give me an array of all the strings between <video>
and </video>
tag. Also how can I change the
<video> abcd.mp4 </video>
to say
<media> abcd.mp4 </media>.
To get each tag by itself:
grep -Eo "<video>(.+?)</video>" myfile.html
To get just the text within the tags:
grep -Eo "<video>(.+?)</video>" myfile.html | sed -E "s|</?video>||g"
If the opening and closing tags are on different lines:
tr "\n" " " < myfile.html | grep -Eo "<video>(.+?)</video>" | sed -E "s|</?video>||g"
Example input:
This is a <video> video1.mp4 </video> file with <other> <random> </tags>
<media> media1.mp4 </media>
<video> video2.mp4 </video>
<media> media 2 with spaces
and over
multiple lines.mp4 </media>
Example output:
video1.mp4
video2.mp4
To get both video
and media
tags (please specify in your original question):
tr "\n" " " < vid.html | grep -Eo "<(video|media)>(.+?)</(video|media)>" | sed -E "s#</?(video|media)>##g"
Output:
video1.mp4
media1.mp4
video2.mp4
media 2 with spaces and over multiple lines.mp4
For your second question, run the whole file through this command:
sed -E "s|(</?)video>|\1media>|g" vid.html
Try this:
$ cat tst.awk
BEGIN{
stag = "<" tag ">"
etag = "</" tag ">"
}
pos = index($0,stag) {
$0 = substr($0,pos+length(stag))
rec = ""
inTag = 1
}
inTag {
if (pos = index($0,etag)) {
rec = rec substr($0,1,pos-1)
gsub(/^[[:space:]]+|[[:space:]]+$/,"",rec)
print "<" rec ">"
inTag = 0
}
else {
rec = rec $0 ORS
}
}
$
$ cat file
<video> video1.mp4 </video>
<media> media1.mp4 </media>
<video>
video2.mp4 </video>
<media> media 2 with
spaces and
over multiple
lines.mp4
</media>
$
$ awk -v tag="video" -f tst.awk file
<video1.mp4>
<video2.mp4>
$
$ awk -v tag="media" -f tst.awk file
<media1.mp4>
<media 2 with
spaces and
over multiple
lines.mp4>
Change print "<" rec ">"
to just print rec
after you understand and are happy with what it's doing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.