简体   繁体   English

Perl中的Grep和Extract数据

[英]Grep and Extract Data in Perl

I have HTML content stored in a variable. 我将HTML内容存储在变量中。 How do I extract data that is found between a set of common tags in the page? 如何提取页面中一组常用标签之间的数据? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other: 例如,我对数据感兴趣(由DATA表示保持在一行标记之间,一行接一行:

...
<td class="jumlah">*DATA_1*</td>
<td class="ud"><a href="">*DATA_2*</a></td>
...

And then I would like to store a mapping DATA_2 => DATA_1 in a hash 然后我想在散列中存储映射DATA_2 => DATA_1

Since it is HTML I think this could work for you? 既然它是HTML我觉得这对你有用吗?

https://metacpan.org/pod/XML::XPath https://metacpan.org/pod/XML::XPath

XPath is the way. XPath就是这样。

Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser. 使用HTML解析模块,如此Q -HTML :: TreeBuilder或HTML :: Parser的答案中所述。

Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language. 从理论上讲,你可以尝试使用正则表达式来做到这一点,但正如链接问题的答案和无数次在SO上所述,使用RegEx解析HTML是一个带有大写字母的坏主意 - 太容易出错,太难获得好吧,不可能100%正确,因为HTML不是常规语言。

You might try this module: HTML::TreeBuilder::XPath . 您可以尝试这个模块: HTML::TreeBuilder::XPath The doc says: 医生说:

This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document. 此模块将典型的XPath方法添加到HTML :: TreeBuilder,以便于查询文档。

Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath . 因为它是HTML,你可能想要使用XPath模块来处理HTML, HTML :: TreeBuilder :: XPath

First you'll need to parse your string using the HTML::TreeBuilder methods. 首先,您需要使用HTML :: TreeBuilder方法解析字符串。 Assuming your webpage's content is in a variable named $content , do it like this: 假设您的网页内容位于名为$content的变量中,请执行以下操作:

my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);

Now you can use XPath expressions to get iterators over the nodes you care about. 现在,您可以使用XPath表达式在您关注的节点上获取迭代器。 This first expression gets all td nodes that are in a tr in a table in the body in the html element: 第一个表达式获取html元素body tabletr所有td节点:

my $tdNodes = $tree->findnodes('/html/body/table/tr/td');

Finally you can just iterate over all the nodes in a loop to find what you want: 最后,您可以遍历循环中的所有节点以查找所需内容:

foreach my $node ($tdNodes->get_nodelist) {
  my $data = $node->findvalue('.'); // the content of the node
  print "$data\n";
}

See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. 有关其方法的更多信息,请参阅HTML :: TreeBuilder文档;有关如何使用NodeSet结果对象的NodeSet文档,请参阅。 w3schools has a passable XPath tutorial here . w3schools 在这里有一个可通过的XPath教程。

With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. 有了这一切,你应该能够进行相当强大的HTML解析来获取你想要的任何元素。 You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. 您甚至可以在XPath查询中指定类,ID等,以确定您想要的节点。 In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes. 在我看来,使用这个修改过的XPath库解析HTML比处理一堆一次性正则表达式要快得多,而且更易于维护。

使用 grep 从<div>容器标签</div><div id="text_translate"><p>我有一个页面,其中包含不同作者的许多帖子。 我想要来自该帖子页面的用户 A 的帖子。</p><p> 如何设置 grep 以查看作者页面中每个帖子的 html 块,然后将帖子的内容打印到文件中? 帖子结构类似于</p><pre>&lt;;--Begin Msg Number #####--&gt; [useless junk i'm not interested in here] &lt;span class="author vcard"&gt;&lt;a class="url fn" href='url here'&gt;User A&lt;/a&gt;&amp;nbsp;&lt;/span&gt; [more junk] &lt;div class='post entry-content '&gt; &lt;!--cached-some date string--&gt; Here's the text I want to extract &lt;/div&gt; [more junk] &lt;hr /&gt;</pre><p> 我认为结构类似于</p><pre>grep /pattern/ output file</pre><p> 但我是否需要明确告诉它只在</p><pre>&lt;.-- begin msg... --&gt;</pre><p> 和</p><pre>&lt;hr /&gt;</pre><p> 绑定帖子的标签,还是 grep 足够智能以自动执行此操作? 我担心当 grep 找到用户 A 的模式时,它会将所有帖子内容打印到一个文件中,而不仅仅是那个特定的。</p></div> - Using grep to extract html from <div> container tags

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Perl提取HTML表数据? - How can I extract HTML table data using Perl? 如何从 Perl 中的 HTML 表中提取数据? - How can I extract data from HTML tables in Perl? 如何使用sed,awk或grep从HTML表格单元格中提取数据? - How can I extract data from HTML table cells using sed, awk, or grep? Perl快速HTML提取 - Perl fast HTML extract 使用 sed 或 grep 提取 HTML 标签之间的文本 - Extract Text between HTML tags with sed or grep 如何仅使用 grep 提取 bash 中的 html 标签 - How to use grep only to extract html tags in bash grep从HTML提取正则表达式href和rel - grep to extract out regular expression href and rel from html 使用 grep 从<div>容器标签</div><div id="text_translate"><p>我有一个页面,其中包含不同作者的许多帖子。 我想要来自该帖子页面的用户 A 的帖子。</p><p> 如何设置 grep 以查看作者页面中每个帖子的 html 块,然后将帖子的内容打印到文件中? 帖子结构类似于</p><pre>&lt;;--Begin Msg Number #####--&gt; [useless junk i'm not interested in here] &lt;span class="author vcard"&gt;&lt;a class="url fn" href='url here'&gt;User A&lt;/a&gt;&amp;nbsp;&lt;/span&gt; [more junk] &lt;div class='post entry-content '&gt; &lt;!--cached-some date string--&gt; Here's the text I want to extract &lt;/div&gt; [more junk] &lt;hr /&gt;</pre><p> 我认为结构类似于</p><pre>grep /pattern/ output file</pre><p> 但我是否需要明确告诉它只在</p><pre>&lt;.-- begin msg... --&gt;</pre><p> 和</p><pre>&lt;hr /&gt;</pre><p> 绑定帖子的标签,还是 grep 足够智能以自动执行此操作? 我担心当 grep 找到用户 A 的模式时,它会将所有帖子内容打印到一个文件中,而不仅仅是那个特定的。</p></div> - Using grep to extract html from <div> container tags 从Perl中的HTMl / XML标记中提取文本 - Extract text from HTMl/XML tags in Perl 使用Perl提取脚本类型html /文本 - Using Perl to Extract script type html/text
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM