简体   繁体   English

如何在 node.js 中将 HTML 页面转换为纯文本?

[英]How to convert HTML page to plain text in node.js?

I know this has been asked before but I can't find a good answer for node.js我知道以前有人问过这个问题,但我找不到 node.js 的好答案

I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched.我需要服务器端从获取的 HTML 页面中提取纯文本(无标签、脚本等)。

I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side.我知道如何使用 jQuery 在客户端执行此操作(获取 body 标记的 .text() 内容),但不知道如何在服务器端执行此操作。

I've tried https://npmjs.org/package/html-to-text but this doesn't handle scripts.我试过https://npmjs.org/package/html-to-text但这不能处理脚本。

  var htmlToText = require('html-to-text');
    var request = require('request');
    request.get(url, function (error, result) {
        var text = htmlToText.fromString(result.body, {
            wordwrap: 130
        });
    });

I've tried phantom.js but can't find a way to just get plain text.我试过 phantom.js 但找不到只获取纯文本的方法。

Use jsdom and jQuery (server-side).使用jsdom和 jQuery(服务器端)。

With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.使用 jQuery,您可以删除所有脚本、样式、模板等,然后您可以提取文本。

Example例子

(This is not tested with jsdom and node, only in Chrome) (这没有用 jsdom 和 node 测试,只在 Chrome 中测试)

jQuery('script').remove()
jQuery('noscript').remove()
jQuery('body').text().replace(/\s{2,9999}/g, ' ')

For those searching for a regex solution, here is my one对于那些正在寻找正则表达式解决方案的人,这是我的解决方案

const HTMLPartToTextPart = (HTMLPart) => (
  HTMLPart
    .replace(/\n/ig, '')
    .replace(/<style[^>]*>[\s\S]*?<\/style[^>]*>/ig, '')
    .replace(/<head[^>]*>[\s\S]*?<\/head[^>]*>/ig, '')
    .replace(/<script[^>]*>[\s\S]*?<\/script[^>]*>/ig, '')
    .replace(/<\/\s*(?:p|div)>/ig, '\n')
    .replace(/<br[^>]*\/?>/ig, '\n')
    .replace(/<[^>]*>/ig, '')
    .replace('&nbsp;', ' ')
    .replace(/[^\S\r\n][^\S\r\n]+/ig, ' ')
);

You can use TextVersionJS ( http://textversionjs.com ) to generate the plain text version of an HTML string.您可以使用 TextVersionJS ( http://textversionjs.com ) 生成 HTML 字符串的纯文本版本。 It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.它是纯 javascript(带有大量 RegExp),因此您可以在浏览器和 node.js 中使用它。

This library may work for your needs, but it's NOT the same as getting the text of an element in the browser.这个库可能适用于您的需求,但它不是与获取在浏览器中元素的文本。 Its purpose is to create a text version of an HTML email.其目的是创建 HTML 电子邮件的文本版本。 This means that things like images are included.这意味着包括图像之类的东西。 For example, given the following HTML and code snippet:例如,给定以下 HTML 和代码片段:

var textVersion = require("textversionjs");
var htmlText = "<html>" +
                    "<body>" +
                        "Lorem ipsum <a href=\"http://foo.foo\">dolor</a> sic <strong>amet</strong><br />" +
                        "Lorem ipsum <img src=\"http://foo.jpg\" alt=\"foo\" /> sic <pre>amet</pre>" +
                        "<p>Lorem ipsum dolor <br /> sic amet</p>" +
                        "<script>" +
                            "alert(\"nothing\");" +
                        "</script>" +
                    "</body>" +
                "</html>";
var plainText = textVersion.htmlToPlainText(htmlText);

The variable plainText will contain this string:变量plainText将包含以下字符串:

Lorem ipsum [dolor] (http://foo.foo) sic amet
Lorem ipsum ![foo] (http://foo.jpg) sic amet
Lorem ipsum dolor
sic amet

Note that it does properly ignore script tags.请注意,它确实会正确忽略脚本标记。 You'll find the latest version of the source code on GitHub.您可以在 GitHub 上找到最新版本的源代码

As another answer suggested, use JSDOM, but you don't need jQuery.正如另一个答案所建议的那样,使用 JSDOM,但您不需要 jQuery。 Try this:尝试这个:

JSDOM.fragment(sourceHtml).textContent

Why not just get textContent of the body tag?为什么不直接获取 body 标签的 textContent 呢?

var body = document.getElementsByTagName('body')[0];
var bodyText = body.textContent;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM