从检索到的页面的JavaScript解析数据

Question

I'm retrieving a web page with OpenURI: 我正在使用OpenURI检索网页：

require 'open-uri'
page = open('http://www.example.com').read.scrub

Now I'd like to parse the values of the attributes playerurl , playerdata and pageurl of the retrieved page. 现在，我想解析检索到的页面的属性playerurl ， playerdata和pageurl的值。 They appear in a <script> tag: 它们出现在<script>标记中：

<script>
..
..
  PlayerWatchdog.init({
      'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
      'playerdata': 'http://www.example.com/player',
      'pageurl': 'http://www.example.com?test=2',
      });
..
..
</script>

What's the smartest way to accomplish this? 什么是最明智的方式来做到这一点？

Answer 1

You can use an HTML parser, such as Nokogiri , to take apart the HTML document, and quickly find the <script> tag you're after. 您可以使用HTML解析器（例如Nokogiri ）来分解HTML文档，并快速找到所需的<script>标记。 The content inside a <script> tag is text, so Nokogiri's text method will return that. <script>标记内的内容是文本，因此Nokogiri的text方法将返回该text 。 Then it's a matter of selectively retrieving the lines you want, which can be done by a simple regular expression: 然后，可以有选择地检索所需的行，可以通过一个简单的正则表达式来完成：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <head>
    <script>
      PlayerWatchdog.init({
          'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
          'playerdata': 'http://www.example.com/player',
          'pageurl': 'http://www.example.com?test=2',
          });
    </script>
  </head>
</html>
EOT

script_text = doc.at('script').text 
playerurl, playerdata, pageurl = %w[
  playerurl
  playerdata
  pageurl
].map{ |i| script_text[/'#{ i }': '([^']+')/, 1] }

playerurl # => "http://cdn.static.de/now/player.swf?ts=2011354353'"
playerdata # => "http://www.example.com/player'"
pageurl # => "http://www.example.com?test=2'"

at returns the first matching <script> Node instance. at返回第一个匹配的<script> Node实例。 Depending on the HTML you might not want the first matching <script> . 根据HTML，您可能不需要第一个匹配的<script> 。 You can use search instead, which will return a NodeSet , similar to an array of Nodes, and then grab a particular element from the NodeSet, or, instead of using a CSS selector, you can use XPath which will let you easily specify a particular occurrence of the tag desired. 您可以改用search ，它会返回NodeSet ，类似于Nodes的数组，然后从NodeSet中获取特定的元素，或者可以使用XPath代替CSS选择器，从而轻松地指定特定的所需标签的出现。

Once the tag is found, text returns its contents, and the task moves from Nokogiri to using a pattern to find what is desired. 找到标签后， text将返回其内容，任务将从Nokogiri转到使用模式来查找所需内容。 /'#{ i }': '([^']+')/ is a simple pattern that looks for a word, passed in in i followed by : ' then capture everything up to the next ' . /'#{ i }': '([^']+')/是寻找单词的简单模式，在i传入，后跟: '然后捕获所有内容，直到下一个' 。 That pattern is passed to String's [] method. 该模式将传递给String的[]方法。

Answer 2

Ruby has no built-in javascript parsing capabilities. Ruby没有内置的javascript解析功能。 You can use a regexp, though this will be rather sensitive to the formatting of the page (for example this will break if the page starts using double quotes for strings): 您可以使用正则表达式，尽管它对页面的格式非常敏感（例如，如果页面开始使用双引号来表示字符串，则该表达式会中断）：

playerurl = page[/'playerurl':\s*'([^']*)'/, 1]

从检索到的页面的JavaScript解析数据

问题描述

2 个解决方案

解决方案1
3 2014-11-03 18:17:24

解决方案2
1 已采纳 2014-11-03 17:31:27

从检索到的页面的JavaScript解析数据

问题描述

2 个解决方案

解决方案1 3 2014-11-03 18:17:24

解决方案2 1 已采纳 2014-11-03 17:31:27

解决方案1
3 2014-11-03 18:17:24

解决方案2
1 已采纳 2014-11-03 17:31:27