简体   繁体   English

Perl正则表达式,用于复杂的多行搜索替换

[英]perl regex for complex multiline search replace

I know there are many questions on this topic, but most are fairly trivial and I'm unable to find a solution for my case. 我知道有关此主题的问题很多,但大多数问题都很琐碎,我无法为我的案例找到解决方案。

I have a set of HTML files with many, many "media" items like the following, each of which is a "paragraph", separated by "\\n\\n". 我有一组HTML文件,其中包含许多类似以下的“媒体”项,每个项都是一个“段落”,以“ \\ n \\ n”分隔。 Here is a link to a sample file of the type I'm working on. 这是我正在处理的类型的示例文件链接

  <li class="media">
    <div class="media-left">
      <a href="#">
        <img class="media-object" src="4_17-HE-assoc.png" width="250" alt="...">
      </a>
    </div>
    <div class="media-body">
      <h4 class="media-heading">Figure 4.17</h4>
      Association plot for the hair-color eye-color data. Left: marginal table, collapsed over
      gender; right: full table.
    </div>
  </li>

For each <img ...> tag, I need to find the src="file" value, and replace the href="#" on the previous line by href="file" class="fancybox . ie, so that item will then look like 对于每个<img ...>标记,我需要找到src="file"值,并将上一行的href="#"替换为href="file" class="fancybox 。然后看起来像

  <li class="media">
    <div class="media-left">
      <a href="4_17-HE-assoc.png" class="fancybox">
        <img class="media-object" src="4_17-HE-assoc.png" width="250" alt="...">
      </a>
    </div>
    <div class="media-body">
      <h4 class="media-heading">Figure 4.17</h4>
      Association plot for the hair-color eye-color data. Left: marginal table, collapsed over
      gender; right: full table.
    </div>
  </li>

I tried the following as a one-liner, but it has no effect, ie, it doesn't make the changes. 我尝试了以下方法,但它没有任何效果,即不会进行更改。

perl -pi~ -e '$/ = "";s|<a href="#">\n(\s*<img class="media object") src=(".*png")|<a class="fancybox" href="\2">\n\1 src=\2|ms' ch03.html

Can someone help with this? 有人可以帮忙吗? I'd be happy with a simple script that I could use for this and modify for other similar modifications of a collection of web files. 我对可以用于此目的的简单脚本感到满意,并且可以对Web文件集合的其他类似修改进行修改。

edit : I'm aware of the advantages of using perl modules such as HTML::TreeBuilder to avoid having to parse HTML directly. 编辑 :我知道使用诸如HTML::TreeBuilder类的perl模块的优势,而不必直接解析HTML。 If someone could give me a start, I could probably take it from there. 如果有人可以给我一个开始,我可能可以从那里开始。

use XML::LibXML qw( );

my $qfn = 'ch03.html';

my $in_qfn = $qfn . "~";
my $out_qfn = $qfn;
rename($qfn, $in_qfn)
   or die("Can't rename \"qfn\": $!\n");

my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_file($in_qfn);

for my $a_node ($doc->findnodes('//a[@href="#"]')) {
   my ($src_node) = $a_node->findnodes('img[1]/@src')
      or next;

   $a_node->setAttribute('href', $src_node->value());
   $a_node->setAttribute('class', 'fancybox');
}
my $html = $doc->toStringHTML();
open(my $fh, '>', $out_qfn)
   or die("Can't create \"$out_qfn\": $!\n");

print($fh $html);

Tested: 经过测试:

$ diff -u ch03.html{~,}
--- ch03.html~  2016-01-20 12:41:30.809203040 -0800
+++ ch03.html   2016-01-20 12:41:31.009201042 -0800
@@ -1,7 +1,7 @@
-<div class="contents">
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
+<html><body><div class="contents">
 <h1 class="tocpage">Chapter 3: Fitting and Graphing Discrete Distributions</h1>
 <hr class="tocpage">
-
 <div class="row">
   <div class="col-md-6">
     <!-- prelude-inserted  -->
@@ -18,7 +18,7 @@
   <div class="col-md-6">
     <h3>Contents</h3>
     <dl class="chaptoc">
-        <dd>3.1. Introduction to discrete distributions</dd>
+<dd>3.1. Introduction to discrete distributions</dd>
         <dd>3.2. Characteristics of discrete distributions</dd>
         <dd>3.3. Fitting discrete distributions</dd>
         <dd>3.4. Diagnosing discrete distributions: Ord plots</dd>
@@ -27,8 +27,7 @@
         <dd>3.7. Chapter summary</dd>
         <dd>3.8. Lab exercises</dd>
     </dl>
-
-  </div>
+</div>
 </div>

 <!-- more-content -->
@@ -38,11 +37,10 @@
        <h3>Selected figures</h3>
      <a class="btn btn-primary" href="../../Rcode/ch03.R" role="button">view R code</a>
     <ul class="media-list">
-      <li class="media">
+<li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="saxony-barplot.png" width="250" alt="males in Saxony families">
-          </a>
+          <a href="saxony-barplot.png" class="fancybox">
+            <img class="media-object" src="saxony-barplot.png" width="250" alt="males in Saxony families"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.2</h4>
@@ -52,9 +50,8 @@

       <li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="dbinom2-plot2-1.png" width="250" alt="Binomial distributions">
-          </a>
+          <a href="dbinom2-plot2-1.png" class="fancybox">
+            <img class="media-object" src="dbinom2-plot2-1.png" width="250" alt="Binomial distributions"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.9</h4>
@@ -64,9 +61,8 @@

       <li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="dpois-xyplot2-1.png" width="250" alt="Poisson distributions">
-          </a>
+          <a href="dpois-xyplot2-1.png" class="fancybox">
+            <img class="media-object" src="dpois-xyplot2-1.png" width="250" alt="Poisson distributions"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.11</h4>
@@ -76,9 +72,8 @@

       <li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="Fed0-plots2-1.png" width="250" alt="Hanging rootogram">
-          </a>
+          <a href="Fed0-plots2-1.png" class="fancybox">
+            <img class="media-object" src="Fed0-plots2-1.png" width="250" alt="Hanging rootogram"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.15</h4>
@@ -89,9 +84,8 @@

       <li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="ordplot1-1.png" width="250" alt="Ord plot for the Butterfly data">
-          </a>
+          <a href="ordplot1-1.png" class="fancybox">
+            <img class="media-object" src="ordplot1-1.png" width="250" alt="Ord plot for the Butterfly data"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.18</h4>
@@ -100,9 +94,10 @@
         </div>
       </li>

-    </ul> <!-- media-list -->
-  </div> <!-- col-md-12 -->
+    </ul>
+<!-- media-list -->
+</div> <!-- col-md-12 -->
 <!-- footer -->
 </div>  <!-- row -->

-</div>
+</div></body></html>

I couldn't resist but write this one-off, super unstable, sends-me-to-parse-html-with-regex-hell sed command: 我忍不住写了这个一次性的,超级不稳定的,用正则表达式发送给我的解析html sed命令:

sed -i.bak '/<a href="#"/ {
    N
    /\n.*<img class=/ {
        s/^\( *<a href="\).*\(\n.*src="\)\([^"]*\)\(.*\)/\1\3" class="fancybox">\2\3\4/
    }
}' ch03.html

This looks for a line with href="#" , appends the next line and then substitutes the filename and fancybox into the a tag. 这将查找带有href="#" ,追加下一行,然后将文件名和fancybox替换为a标签。

Diffing the result and the input file: 区分结果和输入文件:

43c43
<           <a href="#">
---
>           <a href="saxony-barplot.png" class="fancybox">
55c55
<           <a href="#">
---
>           <a href="dbinom2-plot2-1.png" class="fancybox">
67c67
<           <a href="#">
---
>           <a href="dpois-xyplot2-1.png" class="fancybox">
79c79
<           <a href="#">
---
>           <a href="Fed0-plots2-1.png" class="fancybox">

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM