简体   繁体   English

使用Nokogiri提取HTML中的特定节点

[英]Extract specific nodes in HTML using Nokogiri

I'm want to extract a few values from HTML using Nokogiri in this ruby script: 我想在此ruby脚本中使用Nokogiri从HTML提取一些值:

#!/usr/bin/ruby
require 'Nokogiri'

doc = Nokogiri::HTML(<<-END_OF_HTML)
  <html>
  <meta content="text/html; charset=UTF-8"/>
  <body style='margin:20px'>
    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>
    <ul style='list-style-type:none; margin:25px 15px;'>
      <li><b>User name:</b> Test User</li>
      <li><b>User email:</b> test@abc.com</li>
      <li><b>Identifier:</b> abc123def132afd1213afas</li>
      <li><b>Description:</b> Tom's iPad</li>
      <li><b>Model:</b> iPad 3</li>
      <li><b>Platform:</b> </li>
      <li><b>App:</b> Test app name</li>
      <li><b>UserID:</b> </li>
     </ul>
    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>
        <p>We hope you enjoy the app store experience!</p>
        <p style='font-size:18px; color:#999'>Powered by App47</p>
      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

Specifically I want to get the values of the some of the list members like "Identifier:" and "User name:" and store them in strings. 具体来说,我想获取某些列表成员的值,例如"Identifier:""User name:" ,并将它们存储在字符串中。

I'm sure I need to use xpath but that's about it. 我确定我需要使用xpath ,仅此而已。 My understanding is that xpath does node selection. 我的理解是xpath可以选择节点。

What do I need to specify with xpath and then how do I get the selection into some variables? 我需要使用xpath指定什么,然后如何将选择内容放入一些变量中?

Full Solution 完整解决方案

Ultimately I was really asking two questions. 最终我真的在问两个问题。

Question 1 (implicit): How can I see the results of a search using xpath ? 问题1(隐式):如何查看使用xpath的搜索结果?

doc.xpath("SPECIFY_SEARCH_HERE").each do |node|
puts node
end

This works because xpath returns an array that you can parse and then you can do what you want with the results (in my case, print). 之所以xpath是因为xpath返回了一个您可以解析的数组,然后您可以对结果进行所需的操作(在我的情况下为print)。

Question 2: How do I get the value of a particular list item? 问题2:如何获取特定列表项的值?

str = doc.xpath("//ul/li[contains(b, 'Identifier')]/text()").to_s.strip

My analysis on this line is limited, but it looks like it does this: 我对此行的分析是有限的,但看起来它是这样做的:

  1. Find the location of the li child keys with: //ul/li 使用以下命令找到li子键的位置: //ul/li
  2. Select the bolded key ( b ) containing 'Identifier' 选择包含“标识符”的粗体键( b
  3. Extract the value of the selection from #2: /text() 从#2中提取选择的值: /text()
  4. .to_s.strip converts the selection to a string and removes leading/trailing whitespace .to_s.strip将选择.to_s.strip转换为字符串,并删除前导/尾随空格

For anyone better versed in HTML/Ruby/Xpath, feel free to update the explanation for precision. 对于任何精通HTML / Ruby / Xpath的人,请随时更新其解释以提高准确性。

That will return both values you asked for 这将返回您要求的两个值

//ul/li[contains(b, 'Identifier') or contains(b, 'User name')]/text()

Of course you can modify xpath and get only 1 value at one time. 当然,您可以修改xpath并一次只获得1个值。

//ul/li[contains(b, 'Identifier')]/text()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM