简体   繁体   中英

How to cut html tag from very large multiline text file with content with use perl, sed or awk?

I want to transform this text (remove <math>.*?</math> ) with sed, awk or perl:

{|
|-
| colspan="2"|
: <math>
[\underbrace{\color{Red}4,2}_{4 > 2},5,1,7] \rightarrow
[2,\underbrace{\color{OliveGreen}4,5}_{4 < 5},1,7] \rightarrow
[2,4,\underbrace{\color{Red}5,1}_{5 > 1},7] \rightarrow
[2,4,1,\underbrace{\color{OliveGreen}5,7}_{5 < 7}]
</math>
|-
|
: <math>
[\underbrace{\color{OliveGreen}2,4}_{2 < 4},1,5,{\color{Blue}7}] \rightarrow
[2,\underbrace{\color{Red}4,1}_{4 > 1},5,{\color{Blue}7}] \rightarrow
[2,1,\underbrace{\color{OliveGreen}4,5}_{4 < 5},{\color{Blue}7}]
</math>
: <math>
[\underbrace{\color{Red}2,1}_{2 > 1},4,{\color{Blue}5},{\color{Blue}7}] \rightarrow
[1,\underbrace{\color{OliveGreen}2,4}_{2 < 4},{\color{Blue}5},{\color{Blue}7}]
</math>
: <math>
[\underbrace{\color{OliveGreen}1,2}_{1 < 2},{\color{Blue}4},{\color{Blue}5},{\color{Blue}7}]
</math>
|}

Into such text (please forgive me if I remove too much - I should remove <math>.*?</math> ):

{|
|-
| colspan="2"|
: 
|-
|
: 
: 
: 
|}

I read about 20 page and tested 10 scripts but without good results. The best what I do is:

cat dirt-math.txt | awk '/<math>/{cut=1; print;}/<\/math>/{cut=0}!cut'

Whatever it not works correctly since lefts <math></math> it is not bad but I do not know awk to improve it more.

This should do it:

perl -0777 -pe 's!<math>.*?</math>!!sg' dirt-math.txt

-p says we're doing a sed-like readline/printline loop, -0777 says each "line" is actually the whole input file, and -e specifies the code to run (on each "line" (file)).


If your text files are too big to fit into memory (?!), you can try this:

perl -pe 's!<math>.*?</math>!!s; if ($cut) { if (s!^.*?</math>!!) { $cut = 0 } else { $_ = "" } } if (!$cut && s!<math>.*!!s) { $cut = 1 }' dirt-math.txt

or (slightly more readable):

perl -pe '
    s!<math>.*?</math>!!g;
    if ($cut) {
        if (s!^.*?</math>!!) { $cut = 0 }
        else { $_ = "" }
    }
    if (!$cut && s!<math>.*!!s) { $cut = 1 }
' dirt-math.txt

This is effectively a little state machine.

$cut records whether we're in an unclosed <math> tag (and so need to cut out input). If so, we check whether we were able to find/remove </math> . If so, we're done cutting (we found a closing </math> tag); otherwise we overwrite the "current line" with the empty string ( $_ = "" ; this is the actual cutting part).

If, after this, we're not cutting (we're not using else to handle the case where ... </math> not math text <math> appears on a single line), we try to remove <math>... from the input. If so, we've just seen an opening <math> tag and need to start cutting.

也可以使用.. flip-flop(not range)运算符完成此操作,而无需将整个文件存储在内存中并从起点删除<math> ,例如:

perl -wlne 'unless(((/.*<math>/../<\/math>/)||0) > 1){s/<math>//;print}' your-file

If all data is so nicely formatted as in your example, then your solution is very close. I modified it only slightly

in AWK:

sub(/<math>.*/, "") {print; cut=1}
/<\/math>/          {cut=0; next}
!cut

This isn't quite the one-liner but it does what you're looking for. As always there are many ways of doing this. But here I am using '|' as the records separator and ':' as the field separator. That allows me to iterate over the fields in a record that contains math and only print the fields that don't contain <math></math> .

BEGIN {RS="|";FS=":";ORS=""}

/math/ {
    for (i=1;i<=NF;i++) {
        if ($i ~ /math/) {print ":\n"}
        else {print $i}
    }
    print "|";next;
}

/^\}/ {
    print "}";
    next;
}

{
    print $0"|"
}

END {print "\n"}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM