使用Perl正則表達式從HTML文件打印多行模式

Question

我有一個HTML文件。 這是一個樣本

      <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
        <span title="A*" >Individual: <span><b>A*</b></span></span>
      </div>

    </td>

  </tr>

</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

  <tr class="ListItemColorNew">

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
        <div class="GrayTextShade">
          GREY TIDE LLC (LIC# 2222) 
        </div>
      </div>
    </td>

    <td style="width:50%">
      <div class="gvListItemStyle">
        <span class="LargeText15">FRANK WHITE A&#39;SMALLS </span> (LIC# 1111111)
        <div class="GrayTextShade"><i>Alternate Names: JAMES SMALLS</i></div>
        <div class="GrayTextShade">
          WEST RIVER CORP LLC (LIC# 3333) 
        </div>
      </div>
    </td>


    <td style="width: 25%; vertical-align: top">
      <div class="gvListItemStyle">
        <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
        </div>
    </td>

    <td style="width:25%;text-align:right;vertical-align:top">
      <div class="gvListItemStyle">
        <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
    </td>

  </tr>

我正在嘗試提取<td style="width:50%">和</td> 。 數據存儲在文件testFile.txt 。

這是我使用的Perl代碼

 system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt";

Answer 1

您的以下代碼實際上沒有做任何事情：

system("perl -pi.bak -e '/^<td style=\"width:50%\">.+<\\/td>/mg' testFile.txt");

您在沒有捕獲的無效上下文中匹配m// ，因此執行的語句是沒有意義的。
您的模式將永遠不會匹配您的內容，因為：
一種。 您正在使用any字符. ，但除非使用/s修飾符，否則它將不會與換行符匹配。
灣 您正在使用-p進行文件的逐行處理，但是您的模式需要跨行才能匹配。

下面的示例演示了正則表達式解決方案（不推薦）和使用實際的HTML解析器（在本例中為Mojo::DOM 。 有關8分鍾的有用入門視頻，請查看Mojocast第5集

use strict;
use warnings;

use Mojo::DOM;

my $data = do { local $/; <DATA> };

# Regex Solution:
if ( $data =~ m{<td style="width:50%">(.*?)</td>}s ) {
    print "Regex Solution:\n$1";
} else {
    warn "No pattern match found";
}

# Parser Solution:
my $dom = Mojo::DOM->new($data);

my $yourtd = $dom->at(q{td[style="width:50%"]})->content;

print "\nMojo::DOM:\n", $yourtd;

__DATA__
<html>
<head>
<title>Hello World</title>
</head>
<body>
<table>
    <tr>
        </td>
            <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
            </div>

        </td>
    </tr>
</table>

<table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">

    <tr class="ListItemColorNew">
        <td style="width:50%">
            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>
        </td>
        <td style="width: 25%; vertical-align: top">
            <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>
        <td style="width:25%;text-align:right;vertical-align:top">
            <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
            </td>
    </tr>
<table>
</body>
</html>

輸出：

Regex Solution:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>

Mojo::DOM:

            <div class="gvListItemStyle">
                <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
                <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>

                <div class="GrayTextShade">
                GREY TIDE LLC (LIC# 2222) 
                </div>
            </div>

Answer 2

  .*?(<td style="width:50%">((?!<\/td>).)*?<\/td>)

請參閱演示。使用gs標志。

參見演示。

http://regex101.com/r/oC3nN4/15

Answer 3

如評論中所述，刪除正則表達式中的^。

另外，如果要將文件內容視為允許“。”的單行字符串，請使用/ s代替/ mg。 模式以允許匹配換行符'\\ n'。

/<td style=\"width:50%\">.+?<\\/td>/s

。+？ 而在第一次出現</td>而不是最后一次出現時停止匹配

Answer 4

希望您之前已經看到一些避免正則表達式處理HTML的建議？ 真的是真的！ 我能避免使用幾個出色的HTML模塊之一的唯一借口是，您的數據格式錯誤，以至於沒有其他東西可以處理它。

您的HTML文件“樣本”特別無助。 在我固定凹痕之前，線條分散在整個地方。 看完它后，我看到它是一個table元素的結尾，然后是另一個table元素的開始，因此它使幾個元素不平衡，並且要么關閉但未打開，反之亦然。 請不要對我們這樣做。

我構建了一個格式正確的HTML文件，其中包含您的摘錄，這是一個使用HTML::TreeBuilder對其進行處理的程序

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file('html.html');
my @td50 = $tree->look_down(_tag => 'td', style => 'width:50%');
print $_->as_HTML('<>&', '  '), "\n\n" for @td50;

產量

<td style="width:50%">
  <div class="gvListItemStyle"><span class="LargeText15">JAMES BOND A'MONEYPENNY </span> (LIC# 1111111) <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
    <div class="GrayTextShade"> GREY TIDE LLC (LIC# 2222) </div>
  </div>
</td>

如果您或其他人需要它，這是我使用的HTML輸入文檔

<html>
  <body>

    <table>
      <tr>
        <td>
          <div class="criteria" style="padding-left:0;font-style:italic">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;You searched for: 
            <span title="A*" >Individual: <span><b>A*</b></span></span>
          </div>
        </td>
      </tr>
    </table>

    <table cellpadding="5" cellspacing="0" border="0" style="border-collapse: collapse; width: 100%">
      <tr class="ListItemColorNew">

        <td style="width:50%">
          <div class="gvListItemStyle">
            <span class="LargeText15">JAMES BOND A&#39;MONEYPENNY </span> (LIC# 1111111)
            <div class="GrayTextShade"><i>Alternate Names: BOND JAMES</i></div>
            <div class="GrayTextShade">
              GREY TIDE LLC (LIC# 2222) 
            </div>
          </div>
        </td>

        <td style="width: 25%; vertical-align: top">
          <div class="gvListItemStyle">
            <div><img alt="help"  src=\'/Content/images/BrokerCheck/icon-blueCheck.png\'    style=\'vertical-align:top;padding-right:5px\' />Broker</div>
            </div>
        </td>

        <td style="width:25%;text-align:right;vertical-align:top">
          <div class="gvListItemStyle">
            <a class="btn btn-primary" href="/Individual/Summary/5820616">Details &#187;</a>        </div>
        </td>

      </tr>
    </table>
  </body>
</html>

使用Perl正則表達式從HTML文件打印多行模式

問題描述

4 個解決方案

解決方案1
1 已采納 2014-09-07 16:52:13

解決方案2
0 2014-09-07 15:18:47

解決方案3
0 2014-09-07 15:25:48

解決方案4
0 2014-09-07 21:44:04

使用Perl正則表達式從HTML文件打印多行模式

問題描述

4 個解決方案

解決方案1 1 已采納 2014-09-07 16:52:13

解決方案2 0 2014-09-07 15:18:47

解決方案3 0 2014-09-07 15:25:48

解決方案4 0 2014-09-07 21:44:04

解決方案1
1 已采納 2014-09-07 16:52:13

解決方案2
0 2014-09-07 15:18:47

解決方案3
0 2014-09-07 15:25:48

解決方案4
0 2014-09-07 21:44:04