简体   繁体   English

使用Perl正则表达式从HTML文件打印多行模式

[英]Using a Perl regex to print multi-line patterns from an HTML file

I have an HTML file. 我有一个HTML文件。 Here is a sample 这是一个样本

      <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
        <span title="A*" >Individual: <span><b>A*</b></span></span>
      </div>

    </td>

  </tr>

</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

  <tr class="ListItemColorNew">

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
        <div class="GrayTextShade">
          GREY TIDE LLC (LIC# 2222) 
        </div>
      </div>
    </td>

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">FRANK WHITE A&#39;SMALLS </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: JAMES SMALLS</i></div>
        <div class="GrayTextShade">
          WEST RIVER CORP LLC (LIC# 3333) 
        </div>
      </div>
    </td>


    <td style="width: 25%; vertical-align: top">
      <div class="gvListItemStyle">
        <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
        </div>
    </td>

    <td style="width:25%;text-align:right;vertical-align:top">
      <div class="gvListItemStyle">
        <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
    </td>

  </tr>

I'm trying to extract everything between <td style="width:50%"> and </td> . 我正在尝试提取<td style="width:50%"></td> The data is stored in a file testFile.txt . 数据存储在文件testFile.txt

This is the Perl code I used 这是我使用的Perl代码

 system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt";

Your below code isn't actually doing anything: 您的以下代码实际上没有做任何事情:

system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt");
  1. You're matching m// in a void context with no captures, so the executed statement is meaningless. 您在没有捕获的无效上下文中匹配m// ,因此执行的语句是没有意义的。

  2. Your pattern will never match your content because: 您的模式将永远不会匹配您的内容,因为:

    a. 一种。 You're using the any character . 您正在使用any字符. , but it won't match newlines unless you use the /s Modifier . ,但除非使用/s修饰符,否则它将不会与换行符匹配。

    b. You're using -p for line by line processing of the file, but your pattern would need to span lines in order to match. 您正在使用-p进行文件的逐行处理,但是您的模式需要跨行才能匹配。

The following demonstrates both a regex solution (not recommended) and using an actual HTML Parser, in this case Mojo::DOM . 下面的示例演示了正则表达式解决方案(不推荐)和使用实际的HTML解析器(在本例中为Mojo::DOM For a helpful 8 minute introductory video, check out Mojocast Episode 5 有关8分钟的有用入门视频,请查看Mojocast第5集

use strict;
use warnings;

use Mojo::DOM;

my $data = do { local $/; <DATA> };

# Regex Solution:
if ( $data =~ m{<td style="width:50%">(.*?)</td>}s ) {
    print "Regex Solution:\n$1";
} else {
    warn "No pattern match found";
}

# Parser Solution:
my $dom = Mojo::DOM->new($data);

my $yourtd = $dom->at(q{td[style="width:50%"]})->content;

print "\nMojo::DOM:\n", $yourtd;

__DATA__
<html>
<head>
<title>Hello World</title>
</head>
<body>
<table>
    <tr>
        </td>
            <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
            </div>

        </td>
    </tr>
</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

    <tr class="ListItemColorNew">
        <td style="width:50%">
            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>
        </td>
        <td style="width: 25%; vertical-align: top">
            <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>
        <td style="width:25%;text-align:right;vertical-align:top">
            <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
            </td>
    </tr>
<table>
</body>
</html>

Outputs: 输出:

Regex Solution:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>

Mojo::DOM:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>
  .*?(<td style="width:50%">((?!<\/td>).)*?<\/td>)

See demo.Use gs flags. 请参阅演示。使用gs标志。

See demo. 参见演示。

http://regex101.com/r/oC3nN4/15 http://regex101.com/r/oC3nN4/15

As said in the comments, remove the ^ in your regexp. 如评论中所述,删除正则表达式中的^。

Also, use /s instead of /mg if you want to treat the file content as a single line string which allows '.' 另外,如果要将文件内容视为允许“。”的单行字符串,请使用/ s代替/ mg。 pattern to allow match new line characters '\\n'. 模式以允许匹配换行符'\\ n'。

/<td style=\"width:50%\">.+?<\\/td>/s

.+? 。+? while stop the matching at the first occurrence of </td> , not the last 而在第一次出现</td>而不是最后一次出现时停止匹配

I hope you've seen previous advice to avoid regexes to process HTML? 希望您之前已经看到一些避免正则表达式处理HTML的建议? It's really true! 真的是真的! The only excuse I can think of for avoiding one of the several excellent HTML modules is that your data is so malformed that nothing else will process it. 我能避免使用几个出色的HTML模块之一的唯一借口是,您的数据格式错误,以至于没有其他东西可以处理它。

Your "sample" of your HTML file is particularly unhelpful. 您的HTML文件“样本”特别无助。 Before I fixed the indentation the lines were scattered all over the place. 在我固定凹痕之前,线条分散在整个地方。 After I looked at it I saw that it was the end of one table element followed by the start of another, so it left several elements unbalanced and either closed but not opened or vice-versa. 看完它后,我看到它是一个table元素的结尾 ,然后是另一个table元素的开始 ,因此它使几个元素不平衡,并且要么关闭但未打开,反之亦然。 Please don't do that to us. 请不要对我们这样做。

I built a well-formed HTML file that contains your extract, and this is a program that will process it that uses HTML::TreeBuilder 我构建了一个格式正确的HTML文件,其中包含您的摘录,这是一个使用HTML::TreeBuilder对其进行处理的程序

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file('html.html');
my @td50 = $tree->look_down(_tag => 'td', style => 'width:50%');
print $_->as_HTML('<>&', '  '), "\n\n" for @td50;

output 产量

<td style="width:50%">
  <div class="gvListItemStyle"><span class="LargeText15">JAMES BOND A'MONEYPENNY </span> (LIC# 1111111) <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
    <div class="GrayTextShade"> GREY TIDE LLC (LIC# 2222) </div>
  </div>
</td>

In case you or others need it, here's the HTML input document that I used 如果您或其他人需要它,这是我使用的HTML输入文档

<html>
  <body>

    <table>
      <tr>
        <td>
          <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
          </div>
        </td>
      </tr>
    </table>

    <table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">
      <tr class="ListItemColorNew">

        <td style="width:50%">
          <div class="gvListItemStyle">
            <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
            <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
            <div class="GrayTextShade">
              GREY TIDE LLC (LIC# 2222) 
            </div>
          </div>
        </td>

        <td style="width: 25%; vertical-align: top">
          <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>

        <td style="width:25%;text-align:right;vertical-align:top">
          <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
        </td>

      </tr>
    </table>
  </body>
</html>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM