简体   繁体   English

有没有办法从Nokogiri获取原始HTML?

[英]Is there a way to get the raw HTML from Nokogiri?

I've seen the " How to get the raw HTML source code for a page by using Ruby or Nokogiri? " which uses something like this: 我已经看过“ 如何通过使用Ruby或Nokogiri获取页面的原始HTML源代码? ”,它使用了类似这样的内容:

file = open("index.html")
puts file.read
page = Nokogiri::HTML(file)

But it seems to move the read point to the end of the file so that Nokogiri can't read the file anymore. 但这似乎将读取点移动到文件的末尾,以使Nokogiri无法再读取文件。 If I swap the read and Nokogiri call: 如果我调换了read和Nokogiri的电话:

file = open("index.html")
puts file.read
page = Nokogiri::HTML(file)

The file is no longer output. 该文件不再输出。 I'd like to be able to query Nokogiri for the HTML it used originally, so that I can do my own extra parsing on the raw source. 我希望能够查询Nokogiri最初使用的HTML,以便可以在原始源代码上进行自己的额外解析。 Ideally, I'd like something like 理想情况下,我想要类似

file = open("index.html")
page = Nokogiri::HTML(file)
raw_html = page.html

Note: I've also tried page.to_html , but it seems to change the formatting slightly. 注意:我也尝试过page.to_html ,但是似乎稍微改变了格式。

You usually pass a File instance so it can be processed by chunks, but passing a string is also ok : 通常,您传递一个File实例,以便可以通过块对其进行处理,但是传递一个字符串也可以

html = File.read("index.html")
page = Nokogiri::HTML(html)
page_html = page.html

Just as a FYI: You can also ask Nokogiri to return the HTML (or XML if that's what you're working with) of the document after Nokogiri has parsed it, or after modifications: 仅供参考:您还可以要求Nokogiri在解析Nokogiri或进行修改后,返回文档的HTML(或XML,如果您使用的是XML)。

doc = Nokogiri::HTML('<head><body>foo</body></head>')
puts doc.to_html

Which will output in pry: 它将以pry输出:

[4] (pry) main: 0> puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
<body>foo</body>
</html>

Notice that Nokogiri did some fixups to make the HTML "more-better". 请注意,Nokogiri进行了一些修复,以使HTML“更好”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM