使用Nokogiri提取HTML中的特定节点

Question

I'm want to extract a few values from HTML using Nokogiri in this ruby script: 我想在此ruby脚本中使用Nokogiri从HTML提取一些值：

#!/usr/bin/ruby
require 'Nokogiri'

doc = Nokogiri::HTML(<<-END_OF_HTML)
  <html>
  <meta content="text/html; charset=UTF-8"/>
  <body style='margin:20px'>
    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>
    <ul style='list-style-type:none; margin:25px 15px;'>
      <li><b>User name:</b> Test User</li>
      <li><b>User email:</b> test@abc.com</li>
      <li><b>Identifier:</b> abc123def132afd1213afas</li>
      <li><b>Description:</b> Tom's iPad</li>
      <li><b>Model:</b> iPad 3</li>
      <li><b>Platform:</b> </li>
      <li><b>App:</b> Test app name</li>
      <li><b>UserID:</b> </li>
     </ul>
    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>
        <p>We hope you enjoy the app store experience!</p>
        <p style='font-size:18px; color:#999'>Powered by App47</p>
      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

Specifically I want to get the values of the some of the list members like "Identifier:" and "User name:" and store them in strings. 具体来说，我想获取某些列表成员的值，例如"Identifier:"和"User name:" ，并将它们存储在字符串中。

I'm sure I need to use xpath but that's about it. 我确定我需要使用xpath ，仅此而已。 My understanding is that xpath does node selection. 我的理解是xpath可以选择节点。

What do I need to specify with xpath and then how do I get the selection into some variables? 我需要使用xpath指定什么，然后如何将选择内容放入一些变量中？

Full Solution 完整解决方案

Ultimately I was really asking two questions. 最终我真的在问两个问题。

Question 1 (implicit): How can I see the results of a search using `xpath` ? 问题1（隐式）：如何查看使用`xpath`的搜索结果？

doc.xpath("SPECIFY_SEARCH_HERE").each do |node|
puts node
end

This works because xpath returns an array that you can parse and then you can do what you want with the results (in my case, print). 之所以xpath是因为xpath返回了一个您可以解析的数组，然后您可以对结果进行所需的操作（在我的情况下为print）。

Question 2: How do I get the value of a particular list item? 问题2：如何获取特定列表项的值？

str = doc.xpath("//ul/li[contains(b, 'Identifier')]/text()").to_s.strip

My analysis on this line is limited, but it looks like it does this: 我对此行的分析是有限的，但看起来它是这样做的：

Find the location of the li child keys with: //ul/li 使用以下命令找到li子键的位置： //ul/li
Select the bolded key ( b ) containing 'Identifier' 选择包含“标识符”的粗体键（ b ）
Extract the value of the selection from #2: /text() 从＃2中提取选择的值： /text()
.to_s.strip converts the selection to a string and removes leading/trailing whitespace .to_s.strip将选择.to_s.strip转换为字符串，并删除前导/尾随空格

For anyone better versed in HTML/Ruby/Xpath, feel free to update the explanation for precision. 对于任何精通HTML / Ruby / Xpath的人，请随时更新其解释以提高准确性。

Answer 1

That will return both values you asked for 这将返回您要求的两个值

//ul/li[contains(b, 'Identifier') or contains(b, 'User name')]/text()

Of course you can modify xpath and get only 1 value at one time. 当然，您可以修改xpath并一次只获得1个值。

//ul/li[contains(b, 'Identifier')]/text()

使用Nokogiri提取HTML中的特定节点

问题描述

Full Solution 完整解决方案

Question 1 (implicit): How can I see the results of a search using `xpath` ? 问题1（隐式）：如何查看使用`xpath`的搜索结果？

Question 2: How do I get the value of a particular list item? 问题2：如何获取特定列表项的值？

1 个解决方案

解决方案1
2 已采纳 2015-10-19 20:04:50

使用Nokogiri提取HTML中的特定节点

问题描述

Full Solution 完整解决方案

Question 1 (implicit): How can I see the results of a search using xpath ? 问题1（隐式）：如何查看使用xpath的搜索结果？

Question 2: How do I get the value of a particular list item? 问题2：如何获取特定列表项的值？

1 个解决方案

解决方案1 2 已采纳 2015-10-19 20:04:50

Question 1 (implicit): How can I see the results of a search using `xpath` ? 问题1（隐式）：如何查看使用`xpath`的搜索结果？

解决方案1
2 已采纳 2015-10-19 20:04:50