[英]Extract specific nodes in HTML using Nokogiri
I'm want to extract a few values from HTML using Nokogiri in this ruby script: 我想在此ruby脚本中使用Nokogiri从HTML提取一些值:
#!/usr/bin/ruby
require 'Nokogiri'
doc = Nokogiri::HTML(<<-END_OF_HTML)
<html>
<meta content="text/html; charset=UTF-8"/>
<body style='margin:20px'>
<p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>
<ul style='list-style-type:none; margin:25px 15px;'>
<li><b>User name:</b> Test User</li>
<li><b>User email:</b> test@abc.com</li>
<li><b>Identifier:</b> abc123def132afd1213afas</li>
<li><b>Description:</b> Tom's iPad</li>
<li><b>Model:</b> iPad 3</li>
<li><b>Platform:</b> </li>
<li><b>App:</b> Test app name</li>
<li><b>UserID:</b> </li>
</ul>
<p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/>
<p>We hope you enjoy the app store experience!</p>
<p style='font-size:18px; color:#999'>Powered by App47</p>
<img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
Specifically I want to get the values of the some of the list members like "Identifier:"
and "User name:"
and store them in strings. 具体来说,我想获取某些列表成员的值,例如"Identifier:"
和"User name:"
,并将它们存储在字符串中。
I'm sure I need to use xpath
but that's about it. 我确定我需要使用xpath
,仅此而已。 My understanding is that xpath
does node selection. 我的理解是xpath
可以选择节点。
What do I need to specify with xpath
and then how do I get the selection into some variables? 我需要使用xpath
指定什么,然后如何将选择内容放入一些变量中?
Ultimately I was really asking two questions. 最终我真的在问两个问题。
xpath
? 问题1(隐式):如何查看使用xpath
的搜索结果? doc.xpath("SPECIFY_SEARCH_HERE").each do |node|
puts node
end
This works because xpath
returns an array that you can parse and then you can do what you want with the results (in my case, print). 之所以xpath
是因为xpath
返回了一个您可以解析的数组,然后您可以对结果进行所需的操作(在我的情况下为print)。
str = doc.xpath("//ul/li[contains(b, 'Identifier')]/text()").to_s.strip
My analysis on this line is limited, but it looks like it does this: 我对此行的分析是有限的,但看起来它是这样做的:
//ul/li
使用以下命令找到li子键的位置: //ul/li
b
) containing 'Identifier' 选择包含“标识符”的粗体键( b
) /text()
从#2中提取选择的值: /text()
.to_s.strip
converts the selection to a string and removes leading/trailing whitespace .to_s.strip
将选择.to_s.strip
转换为字符串,并删除前导/尾随空格 For anyone better versed in HTML/Ruby/Xpath, feel free to update the explanation for precision. 对于任何精通HTML / Ruby / Xpath的人,请随时更新其解释以提高准确性。
That will return both values you asked for 这将返回您要求的两个值
//ul/li[contains(b, 'Identifier') or contains(b, 'User name')]/text()
Of course you can modify xpath and get only 1 value at one time. 当然,您可以修改xpath并一次只获得1个值。
//ul/li[contains(b, 'Identifier')]/text()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.