I want to transform this text (remove <math>.*?</math>
) with sed, awk or perl:
{|
|-
| colspan="2"|
: <math>
[\underbrace{\color{Red}4,2}_{4 > 2},5,1,7] \rightarrow
[2,\underbrace{\color{OliveGreen}4,5}_{4 < 5},1,7] \rightarrow
[2,4,\underbrace{\color{Red}5,1}_{5 > 1},7] \rightarrow
[2,4,1,\underbrace{\color{OliveGreen}5,7}_{5 < 7}]
</math>
|-
|
: <math>
[\underbrace{\color{OliveGreen}2,4}_{2 < 4},1,5,{\color{Blue}7}] \rightarrow
[2,\underbrace{\color{Red}4,1}_{4 > 1},5,{\color{Blue}7}] \rightarrow
[2,1,\underbrace{\color{OliveGreen}4,5}_{4 < 5},{\color{Blue}7}]
</math>
: <math>
[\underbrace{\color{Red}2,1}_{2 > 1},4,{\color{Blue}5},{\color{Blue}7}] \rightarrow
[1,\underbrace{\color{OliveGreen}2,4}_{2 < 4},{\color{Blue}5},{\color{Blue}7}]
</math>
: <math>
[\underbrace{\color{OliveGreen}1,2}_{1 < 2},{\color{Blue}4},{\color{Blue}5},{\color{Blue}7}]
</math>
|}
Into such text (please forgive me if I remove too much - I should remove <math>.*?</math>
):
{|
|-
| colspan="2"|
:
|-
|
:
:
:
|}
I read about 20 page and tested 10 scripts but without good results. The best what I do is:
cat dirt-math.txt | awk '/<math>/{cut=1; print;}/<\/math>/{cut=0}!cut'
Whatever it not works correctly since lefts <math></math>
it is not bad but I do not know awk to improve it more.
This should do it:
perl -0777 -pe 's!<math>.*?</math>!!sg' dirt-math.txt
-p
says we're doing a sed-like readline/printline loop, -0777
says each "line" is actually the whole input file, and -e
specifies the code to run (on each "line" (file)).
If your text files are too big to fit into memory (?!), you can try this:
perl -pe 's!<math>.*?</math>!!s; if ($cut) { if (s!^.*?</math>!!) { $cut = 0 } else { $_ = "" } } if (!$cut && s!<math>.*!!s) { $cut = 1 }' dirt-math.txt
or (slightly more readable):
perl -pe '
s!<math>.*?</math>!!g;
if ($cut) {
if (s!^.*?</math>!!) { $cut = 0 }
else { $_ = "" }
}
if (!$cut && s!<math>.*!!s) { $cut = 1 }
' dirt-math.txt
This is effectively a little state machine.
$cut
records whether we're in an unclosed <math>
tag (and so need to cut out input). If so, we check whether we were able to find/remove </math>
. If so, we're done cutting (we found a closing </math>
tag); otherwise we overwrite the "current line" with the empty string ( $_ = ""
; this is the actual cutting part).
If, after this, we're not cutting (we're not using else
to handle the case where ... </math> not math text <math>
appears on a single line), we try to remove <math>...
from the input. If so, we've just seen an opening <math>
tag and need to start cutting.
也可以使用..
flip-flop(not range)运算符完成此操作,而无需将整个文件存储在内存中并从起点删除<math>
,例如:
perl -wlne 'unless(((/.*<math>/../<\/math>/)||0) > 1){s/<math>//;print}' your-file
If all data is so nicely formatted as in your example, then your solution is very close. I modified it only slightly
in AWK:
sub(/<math>.*/, "") {print; cut=1}
/<\/math>/ {cut=0; next}
!cut
This isn't quite the one-liner but it does what you're looking for. As always there are many ways of doing this. But here I am using '|' as the records separator and ':' as the field separator. That allows me to iterate over the fields in a record that contains math and only print the fields that don't contain <math></math>
.
BEGIN {RS="|";FS=":";ORS=""}
/math/ {
for (i=1;i<=NF;i++) {
if ($i ~ /math/) {print ":\n"}
else {print $i}
}
print "|";next;
}
/^\}/ {
print "}";
next;
}
{
print $0"|"
}
END {print "\n"}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.