简体   繁体   English

使用 Node.js 和 XPath 高效解析 HTML 页面

[英]Performant parsing of HTML pages with Node.js and XPath

I'm into some web scraping with Node.js.我正在使用 Node.js 进行一些网络抓取。 I'd like to use XPath as I can generate it semi-automatically with several sorts of GUI.我想使用 XPath,因为我可以使用多种 GUI 半自动生成它。 The problem is that I cannot find a way to do this effectively.问题是我找不到有效地做到这一点的方法。

  1. jsdom is extremely slow. jsdom非常慢。 It's parsing 500KiB file in a minute or so with full CPU load and a heavy memory footprint.它在一分钟左右解析 500KiB 文件,CPU 负载满,内存占用大。
  2. Popular libraries for HTML parsing (eg cheerio ) neither support XPath, nor expose W3C-compliant DOM.流行的HTML 解析库(例如cheerio )既不支持XPath,也不公开符合W3C 的DOM。
  3. Effective HTML parsing is, obviously, implemented in WebKit, so using phantom or casper would be an option, but those require to be running in a special way, not just node <script> .显然,有效的 HTML 解析是在 WebKit 中实现的,因此使用phantomcasper将是一种选择,但它们需要以特殊方式运行,而不仅仅是node <script> I cannot rely on the risk implied by this change.我不能依赖这种变化所隐含的风险。 For example, it's much more difficult to find how to run node-inspector with phantom .例如,找到如何使用phantom运行node-inspector要困难得多。
  4. Spooky is an option, but it's buggy enough , so that it didn't run at all on my machine. Spooky是一种选择,但它有足够缺陷,因此它根本无法在我的机器上运行。

What's the right way to parse an HTML page with XPath then?那么用 XPath 解析 HTML 页面的正确方法是什么?

You can do so in several steps.您可以通过几个步骤来完成。

  1. Parse HTML with parse5 .使用parse5解析 HTML。 The bad part is that the result is not DOM.不好的部分是结果不是 DOM。 Though it's fast enough and W3C-compiant.虽然它足够快并且符合 W3C。
  2. Serialize it to XHTML with xmlserializer that accepts DOM-like structures of parse5 as input.使用xmlserializer将其序列parse5 XHTML,该xmlserializer接受parse5的类似 DOM 的结构作为输入。
  3. Parse that XHTML again with xmldom .使用xmldom再次解析该 XHTML。 Now you finally have that DOM.现在你终于拥有了那个 DOM。
  4. The xpath library builds upon xmldom , allowing you to run XPath queries. xpath库建立在xmldom ,允许您运行 XPath 查询。 Be aware that XHTML has its own namespace, and queries like //a won't work.请注意,XHTML 有自己的命名空间,像//a这样的查询将不起作用。

Finally you get something like this.最后你会得到这样的东西。

const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;

(async () => {
    const html = await fs.readFile('./test.htm');
    const document = parse5.parse(html.toString());
    const xhtml = xmlser.serializeToString(document);
    const doc = new dom().parseFromString(xhtml);
    const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
    const nodes = select("//x:a/@href", doc);
    console.log(nodes);
})();

Note that you have to prepend every single HTML element of a query with the x: prefix, for example to match an a inside a div you would need:请注意,您必须在查询的每个 HTML 元素前加上x:前缀,例如要匹配您需要的div中的a

//x:div/x:a

Libxmljs is currently the fastest implementation (something like a benchmark ) since it's only bindings to the LibXML C-library which supports XPath 1.0 queries: Libxmljs目前是最快的实现(类似于基准测试),因为它仅绑定到支持 XPath 1.0 查询的LibXML C 库:

var libxmljs = require("libxmljs");
var xmlDoc = libxmljs.parseXml(xml);
// xpath queries
var gchild = xmlDoc.get('//grandchild');

However, you need to sanitize your HTML first and convert it to proper XML.但是,您需要先清理 HTML 并将其转换为正确的 XML。 For that you could either use the HTMLTidy command line utility ( tidy -q -asxml input.html ), or if you want it to keep node-only, something like xmlserializer should do the trick.为此,您可以使用HTMLTidy命令行实用程序( tidy -q -asxml input.html ),或者如果您希望它仅保留节点,则xmlserializer 之类的东西应该可以解决问题。

I have just started using npm install htmlstrip-native which uses a native implementation to parse and extract the relevant html parts.我刚刚开始使用npm install htmlstrip-native它使用本机实现来解析和提取相关的 html 部分。 It is claiming to be 50 times faster than the pure js implementation (I have not verified that claim).它声称比纯 js 实现快 50 倍(我尚未验证该说法)。

Depending on your needs you can use html-strip directly, or lift the code and bindings to make you own C++ used internally in htmlstrip-native根据你的需要,你可以直接使用 html-strip,或者提升代码和绑定,让你自己在 htmlstrip-native 内部使用的 C++

If you want to use xpath, then use the wrapper already avaialble here;如果你想使用 xpath,那么使用这里已经可用的包装器; https://www.npmjs.org/package/xpath https://www.npmjs.org/package/xpath

With just one line, you can do it with xpath-html :只需一行,您就可以使用xpath-html

const xpath = require("xpath-html");

const node = xpath.fromPageSource(html).findElement("//*[text()='Made with love by']");

I think Osmosis is what you're looking for.我认为Osmosis就是你要找的。

  • Uses native libxml C bindings使用本机 libxml C 绑定
  • Supports CSS 3.0 and XPath 1.0 selector hybrids支持 CSS 3.0 和 XPath 1.0 选择器混合
  • Sizzle selectors, Slick selectors, and more Sizzle 选择器、Slick 选择器等
  • No large dependencies like jQuery, cheerio, or jsdom没有像 jQuery、cheerio 或 jsdom 这样的大依赖
  • HTML parser features HTML 解析器功能

    • Fast parsing快速解析
    • Very fast searching搜索速度非常快
    • Small memory footprint内存占用小
  • HTML DOM features HTML DOM 特性

    • Load and search ajax content加载和搜索ajax内容
    • DOM interaction and events DOM 交互和事件
    • Execute embedded and remote scripts执行嵌入式和远程脚本
    • Execute code in the DOM在 DOM 中执行代码

Here's an example: 下面是一个例子:

osmosis.get(url)
    .find('//div[@class]/ul[2]/li')
    .then(function () {
        count++;
    })
    .done(function () {
        assert.ok(count == 2);
        assert.done();
    });

There might be never a right way to parse HTML pages.可能永远没有正确的方法来解析 HTML 页面。 A very first review on web scraping and crawling shows me that Scrapy can be a good candidate for your need.对网页抓取和抓取的第一篇评论告诉我Scrapy可以成为满足您需求的理想选择。 It accepts both CSS and XPath selectors.它接受 CSS 和 XPath 选择器。 In the realm of Node.js, we have a pretty new module node-osmosis .在 Node.js 领域,我们有一个非常新的模块node-osmosis This module is built upon libxmljs so that it is supposed to support both CSS and XPath although I did not find any example using XPath.这个模块建立在 libxmljs 之上,所以它应该支持 CSS 和 XPath,尽管我没有找到任何使用 XPath 的例子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM