Cheerio - 使用由空格替換的 html 標簽獲取文本

Question

今天我們使用Cheerio 的，特別是 .text() 方法從 html 輸入中提取文本。

但是當 html 是

<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>

在頁面上視覺上，“by”一詞后面的 /div 確保有一個空格或換行符。 但是當應用cheerio text()時，我們得到的結果是錯誤的：

ByJohn smith => 這是錯誤的，因為我們需要在 By 和 john 之間ByJohn smith一個空格。

一般來說，是否有可能以一種特殊的方式獲取文本，以便將任何 html 標記替換為空格。 （我可以在之后修剪所有多個空格...）

我們希望有約翰史密斯的輸出

Answer 1

一般來說，是否有可能以一種特殊的方式獲取文本，以便將任何 html 標記替換為空格。 （我可以在之后修剪所有多個空格...）

只需在所有標簽前后添加' ' ：

$("*").each(function (index) {
    $(this).prepend(' ');
    $(this).append(' ');
});

然后處理多個空格：

$.text().replace(/\s{2,}/g, ' ').trim();
//=> "By John Smith"

由於cheerio只是NodeJS的jQuery 實現，您可能會發現這些答案也很有用。

工作示例：

const cheerio = require('cheerio');
const $ = cheerio.load(`
    <div>
        By<div><h2 class="authorh2">John Smith</h2></div>
    </div>
`);

$("*").each(function (index) {
    $(this).prepend(' ');
    $(this).append(' ');
});

let raw = $.text();
//=> "        By  John Smith" (duplicate spaces)

let trimmed = raw.replace(/\s{2,}/g, ' ').trim();
//=> "By John Smith"

Answer 2

您可以使用以下正則表達式將所有 HTML 標記替換為空格：

/<\/?[a-zA-Z0-9=" ]*>/g

所以當你用這個正則表達式替換你的 HTML 時，它可能會產生多個空格。 在這種情況下，您可以使用replace(/\\s\\s+/g, ' ')將所有空格替換為一個空格。

查看結果：

 console.log(document.querySelector('div').innerHTML.replaceAll(/<\\/?[a-zA-Z0-9=" ]*>/g, ' ').replace(/\\s\\s+/g, ' ').trim())

 <div> By<div><h2 class="authorh2">John Smith</h2></div> </div>

Answer 3

您可以使用純 JavaScript 來完成此任務。

 const parent = document.querySelector('div'); console.log(parent.innerText.replace(/(\\r\\n|\\n|\\r)/gm, " "))

 <div> By<div><h2 class="authorh2">John Smith</h2></div> </div>

Answer 4

取而代之的cheerio ，你可以使用htmlparser2 。 它允許您在解析 HTML 時每次遇到開始標記、文本或結束標記時定義回調方法。

此代碼導致您想要的輸出字符串：

const htmlparser = require('htmlparser2');

let markup = `<div>
By<div><h2 class="authorh2">John Smith</h2></div>
</div>`;

var parts = [];
var parser = new htmlparser.Parser({
    onopentag: function(name, attributes){
        parts.push(' ');
    },
    ontext: function(text){
        parts.push(text);
    },
    onclosetag: function(tagName){
    // no-op
    }
}, {decodeEntities: true});

parser.write(markup);
parser.end();

// Join the parts and replace all occurances of 2 or more
// spaces with a single space.
const result = parts.join('').replace(/\ {2,}/g, ' ');

console.log(result); // By John Smith

這是關於如何使用它的另一個示例： https : //runkit.com/jfahrenkrug/htmlparser2-demo/1.0.0

Answer 5

Cheerio 的 text() 方法主要用於從抓取中獲取干凈的文本。 正如您已經體驗到的，這與將 HTML 頁面轉換為純文本有點不同。 如果您只需要用於索引的文本，則使用正則表達式替換來添加空格將起作用。 對於其他一些場景，例如轉換為音頻，它並不總是有效，因為您需要區分空格和新行。

我的建議是使用一個庫來將 HTML 轉換為 Markdown。 一種選擇是床。

var TurndownService = require('turndown')

var turndownService = new TurndownService()
var markdown = turndownService.turndown('<div>\nBy<div><h2>John Smith</h2></div></div>')

這將打印出：

'By\n\nJohn Smith\n----------'

最后一行是因為 H2 標題。 Markdown 更容易清理，您可能只需要刪除 URL 和圖像。 文本顯示也更容易被人類閱讀。

Answer 6

如果您想要內容的清晰文本表示，我建議使用lynx （由 Project Gutenberg 使用）或pandoc 。 兩者都可以安裝，然后使用spawn從節點調用。 與運行 puppeteer 和使用 textContent 或 innerText 相比，這些將提供更清晰的文本表示。

Cheerio - 使用由空格替換的 html 標簽獲取文本

問題描述

6 個解決方案

解決方案1
0 2021-12-23 12:57:48

解決方案2
0 2021-12-23 13:22:38

解決方案3
0 2021-12-23 13:22:46

解決方案4
0 2021-12-27 20:07:38

解決方案5
0 2021-12-27 22:38:25

解決方案6
0 2021-12-29 12:54:37

Cheerio - 使用由空格替換的 html 標簽獲取文本

問題描述

6 個解決方案

解決方案1 0 2021-12-23 12:57:48

解決方案2 0 2021-12-23 13:22:38

解決方案3 0 2021-12-23 13:22:46

解決方案4 0 2021-12-27 20:07:38

解決方案5 0 2021-12-27 22:38:25

解決方案6 0 2021-12-29 12:54:37

解決方案1
0 2021-12-23 12:57:48

解決方案2
0 2021-12-23 13:22:38

解決方案3
0 2021-12-23 13:22:46

解決方案4
0 2021-12-27 20:07:38

解決方案5
0 2021-12-27 22:38:25

解決方案6
0 2021-12-29 12:54:37