如何在Perl中修改HTML文件？

Question

I have a bunch of HTML files, and what I want to do is to look in each HTML file for the keyword 'From Argumbay' and change this with some href that I have. 我有一堆HTML文件，我要做的是在每个HTML文件中查找关键字“ From Argumbay”，并用我拥有的href进行更改。 I thought its very simple at first, so what I did is I opended each HTML file and loaded its content into an array (list), then I looked for each keyword and replaced it with s///, and dumped the contents to the file, what the problem? 我起初以为它很简单，所以我要做的是打开每个HTML文件并将其内容加载到数组（列表）中，然后寻找每个关键字并将其替换为s ///，然后将内容转储到文件，什么问题？ sometimes the keyword can also appear in a href, which in this case I dont want it to be replaced, or it can appear inside some tags and such. 有时，关键字也可以出现在href中，在这种情况下，我不想替换该关键字，或者它可以出现在某些标记等中。

An EXAMPLE: http://www.astrosociety.org/education/surf.html 范例： http ： //www.astrosociety.org/education/surf.html

I would like my script to replace each occurance of the word 'here' with some href that I have in $href, but as you can see, there is another 'here' which is already href'ed, I dont want it to href this one again. 我希望我的脚本用$ href中的href替换单词'here'的每个出现，但是正如您所看到的，已经有另一个'here'已经href'，我不希望它href这一次。 In this case there arent additional 'here's there except from the href, but lets assume that there are. 在这种情况下，除了href之外，这里没有其他“这里”，但假设存在。

I want to replace the keyword only if its just text, any idea? 我只想在只显示文字的情况下替换关键字，知道吗？

BOUUNTY EDIT: Hi, I believe its a simple thing, But seems like it erases all the comments found in the HTML, SHTML file(the main issue is that it erases SSI's in SHTMLs), i tried using: store_comments(1) method on the $html before calling the recursive function, but to no avail. BOUUNTY EDIT：您好，我相信这很简单，但是似乎它擦除了HTML，SHTML文件中的所有注释（主要问题是它擦除了SHTML中的SSI），我尝试使用以下方法：store_comments（1）方法$ html，然后再调用递归函数，但无济于事。 any idea what am I missing here? 知道我在这里想念什么吗？

Answer 1

To do this with HTML::TreeBuilder , you would read the file, modify the tree, and write it out (to the same file, or a different file). 要使用HTML :: TreeBuilder做到这一点，您将读取文件，修改树并将其写出（到相同文件或不同文件中）。 This is fairly complex, because you're trying to convert part of a text node into a tag, and because you have comments that can't move. 这是相当复杂的，因为您试图将文本节点的一部分转换为标签，并且注释无法移动。

A common idiom with HTML-Tree is to use a recursive function that modifies the tree: HTML-Tree的常见用法是使用修改树的递归函数：

use strict;
use warnings;
use 5.008;

use File::Slurp 'read_file';
use HTML::TreeBuilder;

sub replace_keyword
{
  my $elt = shift;

  return if $elt->is_empty;

  $elt->normalize_content;      # Make sure text is contiguous

  my $content = $elt->content_array_ref;

  for (my $i = 0; $i < @$content; ++$i) {
    if (ref $content->[$i]) {
      # It's a child element, process it recursively:
      replace_keyword($content->[$i])
          unless $content->[$i]->tag eq 'a'; # Don't descend into <a>
    } else {
      # It's text:
      if ($content->[$i] =~ /here/) { # your keyword or regexp here
        $elt->splice_content(
          $i, 1, # Replace this text element with...
          substr($content->[$i], 0, $-[0]), # the pre-match text
          # A hyperlink with the keyword itself:
          [ a => { href => 'http://example.com' },
            substr($content->[$i], $-[0], $+[0] - $-[0]) ],
          substr($content->[$i], $+[0])   # the post-match text
        );
      } # end if text contains keyword
    } # end else text
  } # end for $i in content index
} # end replace_keyword


my $content = read_file('foo.shtml');

# Wrap the SHTML fragment so the comments don't move:
my $html = HTML::TreeBuilder->new;
$html->store_comments(1);
$html->parse("<html><body>$content</body></html>");

my $body = $html->look_down(qw(_tag body));
replace_keyword($body);

# Now strip the wrapper to get the SHTML fragment back:
$content = $body->as_HTML;
$content =~ s!^<body>\n?!!;
$content =~ s!</body>\s*\z!!;

print STDOUT $content; # Replace STDOUT with a suitable filehandle

The output from as_HTML will be syntactically correct HTML, but not necessarily nicely-formatted HTML for people to view the source of. as_HTML的输出在语法上将是正确的HTML，但对于人们查看其来源而言，格式不一定是格式正确的HTML。 You can use HTML::PrettyPrinter to write out the file if you want that. 如果需要，可以使用HTML :: PrettyPrinter来写出文件。

Answer 2

If tags matter in your search and replace, you'll need to use HTML::Parser . 如果标签在搜索和替换中很重要，则需要使用HTML :: Parser 。

This tutorial looks a bit easier to understand than the documentation with the module. 与该模块的文档相比，本教程看起来更容易理解。

Answer 3

If you wanted to go a regular-expression-only type method and you're prepared to accept the following provisos: 如果您想使用纯正则表达式类型方法，并且准备接受以下条件：

this will not work correctly within HTML comments 这将无法在HTML注释中正常工作
this will not work where the < or > character is used within a tag 在标记中使用<或>字符时，这将不起作用
this will not work where the < or > character is used and not part of a tag 这在使用<或>字符且不是标签一部分的情况下不起作用
this will not work where a tag spans multiple lines (if you're processing one line at a time) 如果标签跨越多行，则此方法将无效（如果您一次要处理一行）

If any of the above conditions do exist then you will have to use one of the HTML/XML parsing strategies outlined by other answers. 如果上述条件确实存在，那么您将不得不使用其他答案概述的HTML / XML解析策略之一。

Otherwise: 除此以外：

my $searchfor = "From Argumbay";
my $replacewith = "<a href='http://google.com/?s=Argumbay'>From_Argumbay</a>";

1 while $html =~ s/
  \A             # beginning of string
  (              # group all non-searchfor text
    (            # sub group non-tag followed by tag
      [^<]*?     # non-tags (non-greedy)
      <[^>]*>    # whole tags
    )*?          # zero or more (non-greedy)
  )
  \Q$searchfor\E # search text
/$1$replacewith/sx;

Note that this will NOT work if $searchfor matches $replacetext (so don't put "From Argumbay" back into the replacement text). 请注意，如果$searchfor匹配$replacetext （因此不要将“ From Argumbay”放回替换文本）中，这将不起作用。

如何在Perl中修改HTML文件？

问题描述

3 个解决方案

解决方案1
7 已采纳 2010-10-11 00:17:45

解决方案2
3 2010-10-10 15:50:13

解决方案3
0 2010-10-11 08:08:41

如何在Perl中修改HTML文件？

问题描述

3 个解决方案

解决方案1 7 已采纳 2010-10-11 00:17:45

解决方案2 3 2010-10-10 15:50:13

解决方案3 0 2010-10-11 08:08:41

解决方案1
7 已采纳 2010-10-11 00:17:45

解决方案2
3 2010-10-10 15:50:13

解决方案3
0 2010-10-11 08:08:41