[英]Perl Script removing words from one file to an output file
Im pretty sure this is really basic. 我很确定这是真的。 However I have no knowledge of Perl and only need to use it this once.
但是我不了解Perl,只需要使用一次即可。 So I appreciate your patience.
因此,感谢您的耐心配合。
I am trying to remove unwanted text from a single line below which is in HTML: 我试图从下面的HTML中的一行中删除不需要的文本:
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>
All I want to be left with is Run Printable TCI List (<i>Revised</i>)
which is the text at the end before the </a>
. 我只想剩下的就是
Run Printable TCI List (<i>Revised</i>)
,它是</a>
前面的文本。 I have around 500 of these lines and since they could be changed in the future it makes sense to create a program. 我大约有500行,由于将来可能会更改它们,因此创建程序很有意义。 Below is my Perl code so far:
以下是到目前为止的我的Perl代码:
open (SEARK, 'C:\\HTMLsorter\\sources.txt');
open (OUTSEARK, '>C:\\HTMLsorter\\outseark.txt');
while(<SEARK>) {
chomp;
if ($_=~/<a target/) {
$_ =~ s/\<i>//g;
$_ =~ s/\<\/i>//g;
@itemsa = split(/>/);
@itemsb = split(/</, $itemsa[1]);
print OUTSEARK ("$itemsb[0]\n");
}
}
close (SEARK);
close (OUTSEARK);
I'm sure you can read this but just to explain I am opening a file called sources.txt
where there are the 500 lines to be sorted. 我确定您可以阅读,但只是为了说明我正在打开一个名为
sources.txt
的文件,其中有500行需要排序。 The output file will be outseark.txt
. 输出文件将是
outseark.txt
。 So far it will output this: 到目前为止,它将输出以下内容:
Run Printable TCI List (Revised)
This is obviously due to the split aiming at everything in and around the arrows. 显然,这是由于针对箭头中及其周围的所有内容进行了拆分。 Any ideas how I keep the italics inside the brackets?
有什么想法可以将斜体放在方括号内吗? To be left with:
与:
Run Printable TCI List (<i>Revised<i>)
Thanks for looking. 感谢您的光临。
#!/usr/bin/perl
use strict;
use warnings;
open IFH, '<myfile.txt';
open OFH, '>output.txt';
while (<IFH>) {
if (/<a\s+target.*?>(.*?)<\/a>/i)
{
$_ = $1;
s/<.*?>//g;
print OFH "$_\n";
}
}
close IFH;
close OFH;
You could do this in one liner. 您可以在一个衬里中执行此操作。
cat inputfile|perl -ne 'if (s#<a\s+target[^>]+>(.+?)</a>##is){print "$1\n";}'>outputfile
It is working: 这是工作:
echo '<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List 1(<i>Revised<i>)</a>
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List 2(<i>Revised<i>)</a>
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List 3(<i>Revised<i>)</a>'|\
perl -ne 'if (s#<a\s+target[^>]+>(.+?)</a>##is){print "$1\n";}'
Run Printable TCI List (<i>Revised<i>)
Run Printable TCI List 1(<i>Revised<i>)
Run Printable TCI List 2(<i>Revised<i>)
Run Printable TCI List 3(<i>Revised<i>)
You should use a proper HTML parser, such as HTML::TreeBuilder
. 您应该使用适当的HTML解析器,例如
HTML::TreeBuilder
。 The code is no more complex as this program demonstrates 该程序演示的代码不再复杂
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_file(*DATA);
print $_->as_text, "\n" for $tree->look_down(_tag => 'a', target => qr/./);
__DATA__
<a target="_blank" href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>
output 输出
Run Printable TCI List (Revised)
Edit 编辑
To use this technique on the files in your example, the code looks like this 要对示例中的文件使用此技术,代码如下所示
use strict;
use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_file('C:\HTMLsorter\sources.txt');
open my $out, '>', 'C:\HTMLsorter\outseark.txt' or die $!;
print $out $_->as_text, "\n" for $tree->look_down(_tag => 'a', target => qr/./);
Edit 2 编辑2
Now that I understand better what you need I can offer this alternative solution. 现在,我更好地了解了您的需求,我可以提供这种替代解决方案。 It uses the
HTML::DOM
module to access the Document Object Model of an HTML document, as getting the result you needed with HTML::TreeBuilder
is relatively difficult. 它使用
HTML::DOM
模块来访问HTML文档的文档对象模型 ,因为使用HTML::TreeBuilder
获得所需的结果相对困难。
I've also noticed that your sample HTML contains <i>Revised<i>
which clearly should be <i>Revised</i>
, and I have corrected it for this sample test. 我还注意到您的示例HTML包含
<i>Revised<i>
,显然应该将其<i>Revised</i>
,并且我已针对此示例测试对其进行了更正。 Regardless, Perl trieds to parse bad HTML as a browser would, and even with the error the output is useable. 无论如何,Perl都试图像浏览器一样解析错误的HTML,即使出现错误,输出还是可用的。
use strict;
use warnings;
use HTML::DOM;
my $dom = HTML::DOM->new;
$dom->parse_file('C:\HTMLsorter\sources.txt') or die $!;
open my $out, '>', 'C:\HTMLsorter\outseark.txt' or die $!;
print $out $_->innerHTML, "\n" for grep $_->attr('target'), $dom->getElementsByTagName('a');
output 输出
(With tags corrected) (已更正标签)
Run Printable TCI List (<i>Revised</i>)
(With original tags) (带有原始标签)
Run Printable TCI List (<i>Revised<i>)</i></i>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.