[英]Parse data from JavaScript of retrieved page
I'm retrieving a web page with OpenURI: 我正在使用OpenURI检索网页:
require 'open-uri'
page = open('http://www.example.com').read.scrub
Now I'd like to parse the values of the attributes playerurl
, playerdata
and pageurl
of the retrieved page. 现在,我想解析检索到的页面的属性
playerurl
, playerdata
和pageurl
的值。 They appear in a <script>
tag: 它们出现在
<script>
标记中:
<script>
..
..
PlayerWatchdog.init({
'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
'playerdata': 'http://www.example.com/player',
'pageurl': 'http://www.example.com?test=2',
});
..
..
</script>
What's the smartest way to accomplish this? 什么是最明智的方式来做到这一点?
You can use an HTML parser, such as Nokogiri , to take apart the HTML document, and quickly find the <script>
tag you're after. 您可以使用HTML解析器(例如Nokogiri )来分解HTML文档,并快速找到所需的
<script>
标记。 The content inside a <script>
tag is text, so Nokogiri's text
method will return that. <script>
标记内的内容是文本,因此Nokogiri的text
方法将返回该text
。 Then it's a matter of selectively retrieving the lines you want, which can be done by a simple regular expression: 然后,可以有选择地检索所需的行,可以通过一个简单的正则表达式来完成:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<head>
<script>
PlayerWatchdog.init({
'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
'playerdata': 'http://www.example.com/player',
'pageurl': 'http://www.example.com?test=2',
});
</script>
</head>
</html>
EOT
script_text = doc.at('script').text
playerurl, playerdata, pageurl = %w[
playerurl
playerdata
pageurl
].map{ |i| script_text[/'#{ i }': '([^']+')/, 1] }
playerurl # => "http://cdn.static.de/now/player.swf?ts=2011354353'"
playerdata # => "http://www.example.com/player'"
pageurl # => "http://www.example.com?test=2'"
at
returns the first matching <script>
Node instance. at
返回第一个匹配的<script>
Node实例。 Depending on the HTML you might not want the first matching <script>
. 根据HTML,您可能不需要第一个匹配的
<script>
。 You can use search
instead, which will return a NodeSet , similar to an array of Nodes, and then grab a particular element from the NodeSet, or, instead of using a CSS selector, you can use XPath which will let you easily specify a particular occurrence of the tag desired. 您可以改用
search
,它会返回NodeSet ,类似于Nodes的数组,然后从NodeSet中获取特定的元素,或者可以使用XPath代替CSS选择器,从而轻松地指定特定的所需标签的出现。
Once the tag is found, text
returns its contents, and the task moves from Nokogiri to using a pattern to find what is desired. 找到标签后,
text
将返回其内容,任务将从Nokogiri转到使用模式来查找所需内容。 /'#{ i }': '([^']+')/
is a simple pattern that looks for a word, passed in in i
followed by : '
then capture everything up to the next '
. /'#{ i }': '([^']+')/
是寻找单词的简单模式,在i
传入,后跟: '
然后捕获所有内容,直到下一个'
。 That pattern is passed to String's []
method. 该模式将传递给String的
[]
方法。
Ruby has no built-in javascript parsing capabilities. Ruby没有内置的javascript解析功能。 You can use a regexp, though this will be rather sensitive to the formatting of the page (for example this will break if the page starts using double quotes for strings):
您可以使用正则表达式,尽管它对页面的格式非常敏感(例如,如果页面开始使用双引号来表示字符串,则该表达式会中断):
playerurl = page[/'playerurl':\s*'([^']*)'/, 1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.