简体   繁体   English

如何使用Ruby和Nokogiri解析LI / DL / DD标签结构?

[英]How do I parse LI/DL/DD tag structure using Ruby and Nokogiri?

I'm trying to parse html that contains both an ordered list as well as DL/DD tags. 我正在尝试解析既包含有序列表又包含DL / DD标签的html。 The goal is to create an xml structure that itemizes the contents of EACH tag adding some attribute. 目标是创建一个XML结构,该结构逐项列出EACH标签的内容并添加一些属性。 In end effect flattening the structure (desired output will be shown at the end of the question). 最终使结构变平(问题的结尾将显示所需的输出)。

Here's an example of the html stored in a file (contained in test.html in my code): 这是一个存储在文件中的html示例(包含在我的代码中的test.html中):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">  
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">  
<head>  
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />  
<title>Test Structure</title>  
</head>  
<body>  
<ol><li>Item 1 - Level 1  
<dl><dd>Item 1.1 - Level 2  
</dd><dd>Item 1.2 - Level 2  
</dd></dl>  
</li><li>Item 2 - Level 1  
<dl><dd>Item 2.1 - Level 2  
<dl><dd>Item 2.1.1 - Level 3  
</dd><dd>Item 2.1.2 - Level 3  
<dl><dd>Item 2.1.2.1 - Level 4  
</dd><dd>Item 2.1.2.2 - Level 4  
</dd></dl>  
</dd></dl>  
</dd><dd>Item 2.2 - Level 2  
<dl><dd>Item 2.2.1 - Level 3  
</dd><dd>Item 2.2.2 - Level 3  
<dl><dd>Item 2.2.2.1 - Level 4  
</dd><dd>Item 2.2.2.2 - Level 4  
</dd></dl>  
</dd><dd>Item 2.2.3 - Level 3  
<dl><dd>Item 2.2.3.1 - Level 4  
</dd><dd>Item 2.2.3.2 - Level 4  
</dd></dl>  
</dd><dd>Item 2.2.4 - Level 3  
</dd></dl>  
</dd></dl>  
</li><li>Item 3 - Level 1  
<dl><dd>Item 3.1 - Level 2  
</dd><dd>Item 3.2 - Level 2  
</dd></dl>  
</li></ol>  
</body>  
</html>

Output from HTML (shown here you don't see the indentation you would see in a browser): HTML的输出(在这里显示,您看不到浏览器中的缩进):

 
 
 
  1. Item 1 - Level 1
    Item 1.1 - Level 2
    Item 1.2 - Level 2
  2. Item 2 - Level 1
    Item 2.1 - Level 2
    Item 2.1.1 - Level 3
    Item 2.1.2 - Level 3
    Item 2.1.2.1 - Level 4
    Item 2.1.2.2 - Level 4
    Item 2.2 - Level 2
    Item 2.2.1 - Level 3
    Item 2.2.2 - Level 3
    Item 2.2.2.1 - Level 4
    Item 2.2.2.2 - Level 4
    Item 2.2.3 - Level 3
    Item 2.2.3.1 - Level 4
    Item 2.2.3.2 - Level 4
    Item 2.2.4 - Level 3
  3. Item 3 - Level 1
    Item 3.1 - Level 2
    Item 3.2 - Level 2

Desired output: 所需的输出:

<job>  
<req level='1'>Item 1 - Level 1</req>  
<req level='1.1'>Item 1.1 - Level 2</req>  
<req level='1.2'>Item 1.2 - Level 2</req>  
<req level='2'>Item 2 - Level 1</req>  
<req level='2.1'>Item 2.1 - Level 2</req>  
<req level='2.1.1'>Item 2.1.1 - Level 3</req>  
<req level='2.1.2'>Item 2.1.2 - Level 3</req>  
<req level='2.1.2.1'>Item 2.1.2.1 - Level 4</req>  
<req level='2.1.2.2'>Item 2.1.2.2 - Level 4</req>  
<req level='2.2'>Item 2.2 - Level 2</req>  
<req level='2.2.1'>Item 2.2.1 - Level 3</req>  
<req level='2.2.2'>Item 2.2.2 - Level 3</req>  
<req level='2.2.2.1'>Item 2.2.2.1 - Level 4</req>  
<req level='2.2.2.2'>Item 2.2.2.2 - Level 4</req>  
<req level='2.2.3'>Item 2.2.3 - Level 3</req>  
<req level='2.2.3.1'>Item 2.2.3.1 - Level 4</req>  
<req level='2.2.3.2'>Item 2.2.3.2 - Level 4</req>  
<req level='2.2.4'>Item 2.2.4 - Level 3</req>  
<req level='3'>Item 3 - Level 1</req>  
<req level='3.1'>Item 3.1 - Level 2</req>  
<req level='3.2'>Item 3.2 - Level 2</req>  
</job>

Note that we want to derive the hierarchy from traversing the structure, not from the actual contents of each LI and DD attributes...the contents of my example list out the hierarchy (1, 1.1, 1.2 ...) but in the actual data we won't see that. 请注意,我们要从遍历结构中得出层次结构,而不是从每个LI和DD属性的实际内容中得出...我的示例内容列出了层次结构(1、1.1、1.2 ...),但在实际数据,我们看不到。 The "level" attribute should reflect the traversal of the structure. “级别”属性应反映结构的遍历。

I'm new to both Ruby as well as Nokogiri but here is my attempt at reading the HTML (haven't got to creating the XML). 我对Ruby和Nokogiri都是陌生的,但这是我尝试阅读HTML的尝试(不必创建XML)。 I'm stuck separating out the LI nodes and contents. 我被困在分离出LI节点和内容。 I've tried using .each , children.each , etc: 我试过使用.eachchildren.each等:

require 'rubygems'  
require 'open-uri'  
require 'nokogiri'  

url = "test.html"  
doc = Nokogiri::HTML(open(url))  
line = "1"  
doc.css("ol[1]").children.each do |n|  
    puts line + n.content.to_s  
    line.succ!  
    n.children do |c|  
        puts line + c.content.to_s  
        line.succ!  
    end  
end  

You can use the node_name method to determine what is text and what is a child, here's a sample function that spits out the name of the html tags under the ol: 您可以使用node_name方法来确定什么是文本和什么是子项,这是一个示例函数,它在ol下吐出html标签的名称:

def traverse(node, indent = 0)
  node.children.each do |child|
    next if child.node_name == "text"
    puts "  "*indent + child.node_name
    traverse(child, indent+1)
  end
end

traverse doc.css("ol[1]")

(the text nodes that i'm skipping above are the textual content of the tags) (我在上面跳过的文本节点是标记的文本内容)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM