简体   繁体   English

搜索和替换RANGE模式

[英]Search and Replace RANGE Pattern

Imagine multiple HTML files were merged with all the leftover formatting, tags, etc--never mind why --What tools should one use to search from the beginning lines of the subsequently merged html files, ie <!doctype html>... to the beginning of the <h1> header? 想象一下,将多个HTML文件与所有剩余的格式,标记等合并了-不用担心为什么 -应该使用哪种工具从随后合并的html文件的开始行中进行搜索,即<!doctype html>...<h1>标头的开头? That range pattern should be replaced by a horizontal rule instead. 该范围模式应改为水平尺。

---END OF PREV MERGED FILE---
---BEGIN SEARCH/REPLACE HERE---
<!doctype html>
        <!--[if !IE]>
        <html class="no-js non-ie" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
        <!--[if IE 7 ]>
        <html class="no-js ie7" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
        <!--[if IE 8 ]>
        <html class="no-js ie8" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
        <!--[if IE 9 ]>
---HEAD,META,ETC---
---END SEARCH/REPLACE HERE---
<h1>TITLE OF NEXT MERGED FILE</h1>

I'm not sure if sed and awk are the wrong tools for this, but something along the line of similar tools/solutions is preferred. 我不确定sedawk是否是错误的工具,但是首选类似工具/解决方案。


Input 输入

<li><strong>email_from = root@localhost</strong>, <strong>email_to = root</strong>, <strong>email_host = localhost</strong> defines respectively when the message is a mail the originator&#8217;s email address, the recipient&#8217;s
 email address and the host to which the mail is sent.<strong><br />
 30658  </strong></li>
 30659  </ul>
 30660  <p>Source: <a title="http://linuxaria.com/howto/enabling-automatic-updates-in-centos-7-and-rhel-7" href="http://linuxaria.com/howto/enabling-automatic-updates-in-centos-7-and-rhel-7">Linuxaria&#8217;s website</a>.</p>
 30661                                                                          </div><!-- end of .post-entry -->

 30662

 30663  <div class="post-edit"></div>
 30664                                                          </div><!-- end of #post-4116 -->
 30665
 30666




 30667          <!doctype html>
 30668          <!--[if !IE]>
 30669          <html class="no-js non-ie" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
 30670          <!--[if IE 7 ]>
 30671          <html class="no-js ie7" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
 30672          <!--[if IE 8 ]>
 30673          <html class="no-js ie8" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
 30674          <!--[if IE 9 ]>
 30675          <html class="no-js ie9" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
 30676          <!--[if gt IE 9]><!-->
 30677  <html class="no-js" lang="en-US" prefix="og: http://ogp.me/ns#"> <!--<![endif]-->
 30678          <head>

 30679                  <meta charset="UTF-8"/>
 30680                  <meta name="viewport" content="width=device-width, initial-scale=1.0">

 30681                  <title>something something</title>

 30682                  <link rel="profile" href="http://gmpg.org/xfn/11"/>
 30683                  <link rel="pingback" href="www.example.com"/>

 30684
 30685          <h1 class="entry-title post-title">Something Something</h1>

Expected Output 预期产量

<li><strong>email_from = root@localhost</strong>, <strong>email_to = root</strong>, <strong>email_host = localhost</strong> defines respectively when the message is a mail the originator&#8217;s email address, the recipient&#8217;s
     email address and the host to which the mail is sent.<strong><br />
     30658  </strong></li>
     30659  </ul>
     30660  <p>Source: <a title="http://linuxaria.com/howto/enabling-automatic-updates-in-centos-7-and-rhel-7" href="http://linuxaria.com/howto/enabling-automatic-updates-in-centos-7-and-rhel-7">Linuxaria&#8217;s website</a>.</p>
     30661                                                                          </div><!-- end of .post-entry -->

     30662

     30663  <div class="post-edit"></div>
     30664                                                          </div><!-- end of #post-4116 -->


    <hr />




     30685          <h1 class="entry-title post-title">Something Something</h1>

This seems to do what you want: 这似乎可以满足您的要求:

awk '/<!doctype html>/{f=1;print "    <hr />";} /<h1 class=/{f=0;} !f' input >output

How it works 这个怎么运作

  • /<!doctype html>/{f=1;print " <hr />";}

    When we reach a line that contains <!doctype html> , this sets flag f to 1 to signal that we should stop printing. 当我们到达包含<!doctype html>的行时,这会将标志f设置为1以表示我们应该停止打印。 Then, we print the horizontal rule. 然后,我们打印水平尺。

  • /<h1 class=/{f=0;}

    When we reach a line that contains <h1 class= . 当我们到达包含<h1 class=的行时。 set flag f to 0 to signal that we can continue printing. 将标志f设置为0表示我们可以继续打印。

  • !f

    This causes the current line to be printed if f is 0 . 如果f0这将导致当前行被打印。

    In more detail, !f is a condition . 更详细地, !f是一个条件 When the condition is true, awk performs an action. 当条件为真时,awk将执行操作。 Since no action was specified, awk will perform its default action which is to print the line. 由于未指定任何操作,awk将执行其默认操作,即打印行。 ! is awk's symbol for negation. 是awk表示否定的符号。 So, when f is false (0), then !f is true and the line is printed. 因此,当f为假(0)时, !f为真,并打印该行。

Keeping the first doctype tag 保留第一个doctype标签

Suppose that we want to remove all doctype tags except for the first. 假设我们要删除第一个之外的所有文档类型标签。 In that case: 在这种情况下:

awk '/<!doctype html>/{count++; if (count>1){f=1; print "    <hr />";}} /<h1 class=/{f=0;} !f' input

This works by adding another variable, count , which tracks how many doctype tags we have seen. 这可以通过添加另一个变量count ,该变量跟踪我们已经看到了多少个doctype标签。 The flag f is set to 1 only after we have seen more than one doctype tag. 仅当我们看到多个文档类型标记后,标志f才设置为1

To demonstrate the above, let's use this input file: 为了演示以上内容,让我们使用此输入文件:

$ cat input2
miscellaneous stuff
30667          <!doctype html>
30668          something
30669          <h1 class="entry-title post-title">Something Something</h1>
More stuff
30667          <!doctype html>
30668          something 2
30669          <h1 class="entry-title post-title">Something Something</h1>
Still More stuff
30667          <!doctype html>
30668          something 3
30669          <h1 class="entry-title post-title">Something Something</h1>
Stuff at end

The output produced by the command is: 该命令产生的输出为:

$ awk '/<!doctype html>/{count++; if (count>1){f=1; print "    <hr />";}} /<h1 class=/{f=0;} !f' input2
miscellaneous stuff
30667          <!doctype html>
30668          something
30669          <h1 class="entry-title post-title">Something Something</h1>
More stuff
    <hr />
30669          <h1 class="entry-title post-title">Something Something</h1>
Still More stuff
    <hr />
30669          <h1 class="entry-title post-title">Something Something</h1>
Stuff at end

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM