[英]Grep and Extract Data in Perl
I have HTML content stored in a variable. 我将HTML内容存储在变量中。 How do I extract data that is found between a set of common tags in the page?
如何提取页面中一组常用标签之间的数据? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:
例如,我对数据感兴趣(由DATA表示保持在一行标记之间,一行接一行:
...
<td class="jumlah">*DATA_1*</td>
<td class="ud"><a href="">*DATA_2*</a></td>
...
And then I would like to store a mapping DATA_2 => DATA_1 in a hash 然后我想在散列中存储映射DATA_2 => DATA_1
Since it is HTML I think this could work for you? 既然它是HTML我觉得这对你有用吗?
https://metacpan.org/pod/XML::XPath https://metacpan.org/pod/XML::XPath
XPath is the way. XPath就是这样。
Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser. 使用HTML解析模块,如此Q -HTML :: TreeBuilder或HTML :: Parser的答案中所述。
Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language. 从理论上讲,你可以尝试使用正则表达式来做到这一点,但正如链接问题的答案和无数次在SO上所述,使用RegEx解析HTML是一个带有大写字母的坏主意 - 太容易出错,太难获得好吧,不可能100%正确,因为HTML不是常规语言。
You might try this module: HTML::TreeBuilder::XPath
. 您可以尝试这个模块:
HTML::TreeBuilder::XPath
。 The doc says: 医生说:
This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.
此模块将典型的XPath方法添加到HTML :: TreeBuilder,以便于查询文档。
Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath . 因为它是HTML,你可能想要使用XPath模块来处理HTML, HTML :: TreeBuilder :: XPath 。
First you'll need to parse your string using the HTML::TreeBuilder methods. 首先,您需要使用HTML :: TreeBuilder方法解析字符串。 Assuming your webpage's content is in a variable named
$content
, do it like this: 假设您的网页内容位于名为
$content
的变量中,请执行以下操作:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
Now you can use XPath expressions to get iterators over the nodes you care about. 现在,您可以使用XPath表达式在您关注的节点上获取迭代器。 This first expression gets all
td
nodes that are in a tr
in a table
in the body
in the html
element: 第一个表达式获取
html
元素body
table
中tr
所有td
节点:
my $tdNodes = $tree->findnodes('/html/body/table/tr/td');
Finally you can just iterate over all the nodes in a loop to find what you want: 最后,您可以遍历循环中的所有节点以查找所需内容:
foreach my $node ($tdNodes->get_nodelist) {
my $data = $node->findvalue('.'); // the content of the node
print "$data\n";
}
See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. 有关其方法的更多信息,请参阅HTML :: TreeBuilder文档;有关如何使用NodeSet结果对象的NodeSet文档,请参阅。 w3schools has a passable XPath tutorial here .
w3schools 在这里有一个可通过的XPath教程。
With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. 有了这一切,你应该能够进行相当强大的HTML解析来获取你想要的任何元素。 You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want.
您甚至可以在XPath查询中指定类,ID等,以确定您想要的节点。 In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.
在我看来,使用这个修改过的XPath库解析HTML比处理一堆一次性正则表达式要快得多,而且更易于维护。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.