简体   繁体   English

试图让 perl regex 找到多行和单行 HTML 注释

[英]Trying to get perl regex to find multi-line AND single-line HTML comments

I'm trying to find both single and multi-line comments in an HTML file.我试图在 HTML 文件中找到单行和多行注释。 I've stripped it down to just a few examples, and some other content just to have something there.我已将其精简为几个示例,以及一些其他内容,只是为了在其中提供一些内容。

I've read a lot of the entries here but can't get a definitive answer to this.我在这里阅读了很多条目,但无法得到明确的答案。 I'm reading in the HTML file in "slurp" mode, and doing a match of my pattern.我正在以“slurp”模式读取 HTML 文件,并匹配我的模式。 This code runs now and prints only the first match.此代码现在运行并仅打印第一个匹配项。

#!C:\Perl\bin\perl.exe 

BEGIN {  unshift @INC, 'C:\rmhperl'; } 

use warnings;
no warnings 'uninitialized';

chdir 'c:\watts\html'; 

open FILE, "test.html" or print 'error opening file "test.html" ';
my $text = do { local $/; <FILE> };
close(FILE);

if ($text =~ m/(?s)(<!--.*?)(-->\n)/sg) {
    print "1 = $1  2= $2\n";
}

exit;

I've set up single and multi-line comments in the HTML file.我已经在 HTML 文件中设置了单行和多行注释。 I can get one or the other printed but not both (at least in "slurp" mode).我可以打印一个或另一个,但不能同时打印(至少在“slurp”模式下)。

I'm told I should be able to accomplish this with a single regex, so the objective is "find all HTML comments, regardless of their being single/multi-line comments" .有人告诉我我应该能够用一个正则表达式来完成这个,所以目标是“找到所有 HTML 注释,不管它们是单行/多行注释”

I built the regex to find both, but finds only the first match -- a multi-line comment.我构建了正则表达式来查找两者,但只找到第一个匹配项——多行注释。

I'm trying to find a way to find every match, whether it occurs on one line or multiple lines.我试图找到一种方法来查找每个匹配项,无论它出现在一行还是多行。 I can find one or the other, but I can't get them to work with one regex.我可以找到其中一个,但我无法让它们使用一个正则表达式。

I can do non-slurp mode, and find the <!-- tag, then loop until I see the --> tag, but wanted to see if I can get it to work with a single regex.我可以做非 slurp 模式,找到<!--标签,然后循环直到看到-->标签,但想看看我是否可以让它与单个正则表达式一起工作。

I've been reading about this, and trying to find relevant examples.我一直在阅读这个,并试图找到相关的例子。 can't see what I'm missing.看不到我错过了什么。 Here's the HTML file snippet I have been using for the regex:这是我一直用于正则表达式的 HTML 文件片段:

HTML file HTML文件

<!DOCTYPE html>
 
<script type="text/javascript" src="fadeslideshow.js"></script>
<style>

.divTable {
    display: block;
    width: 100%;
}

.divTableBody, .divTableRow{ clear: both; }

.divTableCell {
    border: 1px solid #999999;
    float: left;
    overflow: hide;
    padding: 2%;
    width: 45%; }

.divTable:after {
    display: block;
    font-size: 0;
    content: " ";
    clear: both;
    height: 100px; }
</style>
<style type="text/css">
<!--
a:link {color: #0000ff;}
 a:visited {color: #3563a8;} 
 a:active {color: #000000;}
 a:hover {background-color: #000000;}
 a {text-decoration: none;}
 -->
 </style> 
</head>
    <body class="home">

 <div id="white_back">
<div style="text-align: center">
</div>
<div class="chromestyle" id="chromemenu">
<ul>
<!-- <li><a href="xyz.com">Home</a></li>
 -->
 <li><a href="#" rel="dropmenu0">About Us</a></li>
<li><a href="#" rel="dropmenu5">Publications</a></li>   
</ul>
</div>

<!--1st drop down menu
-->                                                   
<div id="dropmenu0" class="dropmenudiv">
</div>

<!--2nd drop down menu -->
<div id="dropmenu1" class="dropmenudiv">
</div>

I presume this is production code, in which case your manager is a scary man as this sort of practice can result in hard-to-find bugs.我认为这是生产代码,在这种情况下,您的经理是一个可怕的人,因为这种做法可能会导致难以发现的错误。 That's acceptable if the code is only for yourself, but inflicting that on others is unfair如果代码只为自己使用,那是可以接受的,但将其强加给他人是不公平的

Some notes on your code关于您的代码的一些说明

  • The shebang line #! shebang 行#! is unnecessary on Windows systems, and in fact does nothing unless you specify command-line options there.在 Windows 系统上是不必要的,实际上除非您在那里指定命令行选项,否则什么都不做。 It's best to drop it altogether最好完全放弃

  • Always use strict and use warnings 'all' , and fix the bugs rather than disabling warnings with no warnings 'uninitialized'始终use strictuse warnings 'all' ,并修复错误而不是禁用no warnings 'uninitialized'

  • BEGIN { unshift @INC, 'C:\\rmhperl' } is best written use lib 'C:\\rmhperl' but you're not using libraries in this case so it will have no effect BEGIN { unshift @INC, 'C:\\rmhperl' }最好写成use lib 'C:\\rmhperl'但在这种情况下你没有使用库所以它不会有任何影响

  • You should use lexical file handles with the three-parameter form of open您应该使用具有open三参数形式的词法文件句柄

  • There is no need for (?s) in the regex pattern as well as the /s modifier.正则表达式模式中不需要(?s)以及/s修饰符。 Unless you are doing something fancy like enabling options for only part of the pattern (which you're not) then people will understand you better if you use the modifier /s除非你正在做一些奇特的事情,比如只为模式的一部分启用选项(你不是),否则如果你使用修饰符/s人们会更好地理解你

The reason you're only finding one comment is that you're only asking for one.您只找到一条评论的原因是您只要求一条评论。 In scalar context a global regex pattern match will iterate through all the matches in the target string one at a time.在标量上下文中,全局正则表达式模式匹配将一次一个地遍历目标字符串中的所有匹配项。 You only call it once so it finds only the first.你只调用一次,所以它只找到第一个。 You can fix that by using a while in place of if您可以通过使用while代替if来解决此问题

I've improved your regex pattern somewhat by making sure that the opening <-- isn't followed by > or by -> which would form an illegal HTML comment.我通过确保开头<--后面没有>->会形成非法的 HTML 注释来稍微改进您的正则表达式模式。 There may also be optional space after the closing -- and the > so I've allowed for that.结束后也可能有可选的空间-->所以我允许这样做。 And you are insisting on a newline after the end of the comment which may not be there, so I've removed that而且你坚持在评论结束后换行,这可能不存在,所以我已经删除了

This code seems to work with your data此代码似乎适用于您的数据

use strict;
use warnings 'all';

my $text = do {
    open my $fh, '<', 'test.html' or print qq{Unable to open file "test.html" for input: $!};
    local $/;
    <$fh>;
};

while ( $text =~ /(<!--(?!-?>).*?--\s*>)/sg ) {
    my $comment = $1;
    print $comment, "\n";
}

output输出

<!--
a:link {color: #0000ff;}
 a:visited {color: #3563a8;} 
 a:active {color: #000000;}
 a:hover {background-color: #000000;}
 a {text-decoration: none;}
 -->
<!-- <li><a href="xyz.com">Home</a></li>
 -->
<!--1st drop down menu
-->
<!--2nd drop down menu -->

获取 HTML 表标签之间的所有文本(单行和多行)<table><tbody><tr><th></th></tr><tr><td> 并生成 json<div id="text_translate"><p> 我有下面的 HTML 表,我想获取标签之间的数据,这些标签有时是单行,有时是多行。</p><pre> &lt;table&gt; &lt;tbody&gt; &lt;tr&gt; &lt;th&gt;Role&lt;/th&gt; &lt;th&gt;Device Name&lt;/th&gt; &lt;th&gt;IP Address &lt;/th&gt; &lt;th&gt;MAC Address &lt;/th&gt; &lt;th&gt;Registered &lt;/th&gt; &lt;th&gt;Subscribers &lt;/th&gt; &lt;th&gt;Events &lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; CM &lt;/td&gt; &lt;td&gt; - &lt;/td&gt; &lt;td&gt;192.168.7.110&amp;nbsp;&lt;/td&gt; &lt;td&gt;506182488323&amp;nbsp;&lt;/td&gt; &lt;td&gt;XYZ &lt;/td&gt; &lt;td&gt;&amp;nbsp;Shkdsd30ec1 &lt;/td&gt; &lt;td&gt;Events &lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt; &lt;/table&gt;</pre><p> 我想使用此表生成 JSON,如下面的代码,使用 javascript</p><pre> { "Role": "CM", "Device Name": "-", "IP Address": "192.168.7.110", "MAC Address": "506182488323", "Registered": "XYZ", "Subscribers": "Shkdsd30ec1", "Events": "Events" }</pre><p> 如果有更多带有键的标签应该像 Role-&gt;Role1-&gt;Role2 等一样递增。</p></div></td></tr></tbody></table> - Get the all the text (single and multi-line) between HTML table tags <table><tbody><th><tr><td> and generate json

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用于多行HTML注释的正则表达式(preg_match_all) - Regex for multi-line HTML comments (preg_match_all) 如何在多行左对齐的同时使单行文本中心对齐? - How to make single-line text center align while multi-line left align? 使用Perl正则表达式从HTML文件打印多行模式 - Using a Perl regex to print multi-line patterns from an HTML file 单行换行的html代码 - html code for single-line line break JavaScript 正则表达式转 select 多行 html 注释 - JavaScript Regex to select multi-line html comment HTML中的多行按钮 - Multi-line buttons in HTML 如何编写多行RegEx表达式 - How to write a Multi-line RegEx Expression RegEx用于匹配单行标准USPS地址 - RegEx for matching a single-line standard USPS address 可编辑的单行输入 - contenteditable single-line input 获取 HTML 表标签之间的所有文本(单行和多行)<table><tbody><tr><th></th></tr><tr><td> 并生成 json<div id="text_translate"><p> 我有下面的 HTML 表,我想获取标签之间的数据,这些标签有时是单行,有时是多行。</p><pre> &lt;table&gt; &lt;tbody&gt; &lt;tr&gt; &lt;th&gt;Role&lt;/th&gt; &lt;th&gt;Device Name&lt;/th&gt; &lt;th&gt;IP Address &lt;/th&gt; &lt;th&gt;MAC Address &lt;/th&gt; &lt;th&gt;Registered &lt;/th&gt; &lt;th&gt;Subscribers &lt;/th&gt; &lt;th&gt;Events &lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt; CM &lt;/td&gt; &lt;td&gt; - &lt;/td&gt; &lt;td&gt;192.168.7.110&amp;nbsp;&lt;/td&gt; &lt;td&gt;506182488323&amp;nbsp;&lt;/td&gt; &lt;td&gt;XYZ &lt;/td&gt; &lt;td&gt;&amp;nbsp;Shkdsd30ec1 &lt;/td&gt; &lt;td&gt;Events &lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt; &lt;/table&gt;</pre><p> 我想使用此表生成 JSON,如下面的代码,使用 javascript</p><pre> { "Role": "CM", "Device Name": "-", "IP Address": "192.168.7.110", "MAC Address": "506182488323", "Registered": "XYZ", "Subscribers": "Shkdsd30ec1", "Events": "Events" }</pre><p> 如果有更多带有键的标签应该像 Role-&gt;Role1-&gt;Role2 等一样递增。</p></div></td></tr></tbody></table> - Get the all the text (single and multi-line) between HTML table tags <table><tbody><th><tr><td> and generate json
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM