如何使用Ruby或Nokogiri获取页面的原始HTML源代码？

Question

I'm using Nokogiri (Ruby Xpath library) to grep contents on web pages. 我正在使用Nokogiri （Ruby Xpath库）来浏览网页上的内容。 Then I found problems with some web pages, such as Ajax web pages, and that means when I view source code I won't be seeing the exact contents such as <table> , etc. 然后我发现了一些网页的问题，比如Ajax网页，这意味着当我查看源代码时，我将看不到确切的内容，例如<table>等。

How can I get the HTML code for the actual content? 如何获取实际内容的HTML代码？

Answer 1

Don't use Nokogiri at all if you want the raw source of a web page. 如果您想要网页的原始来源，请不要使用Nokogiri。 Just fetch the web page directly as a string, and then do not feed that to Nokogiri. 只需将网页直接作为字符串获取，然后不要将其提供给Nokogiri。 For example: 例如：

require 'open-uri'
html = open('http://phrogz.net').read
puts html.length #=> 8461
puts html        #=> ...raw source of the page...

If, on the other hand, you want the post-JavaScript-modified contents of a page (such as an AJAX library that executes JavaScript code to fetch new content and change the page), then you can't use Nokogiri. 另一方面，如果您想要一个页面的JavaScript后修改内容（例如执行JavaScript代码的AJAX库来获取新内容并更改页面），那么您就不能使用Nokogiri。 You need to use Ruby to control a web browser (eg read up on Selenium or Watir). 您需要使用Ruby来控制Web浏览器（例如，阅读Selenium或Watir）。

如何使用Ruby或Nokogiri获取页面的原始HTML源代码？

问题描述

1 个解决方案

解决方案1
6 已采纳 2012-06-06 19:55:46

如何使用Ruby或Nokogiri获取页面的原始HTML源代码？

问题描述

1 个解决方案

解决方案1 6 已采纳 2012-06-06 19:55:46

解决方案1
6 已采纳 2012-06-06 19:55:46