使用Javascript获取网页文本

Question

python库BeautifulSoup具有一个名为get_text（）的函数，该函数可以获取已解析的HTML页面，例如： https ://pastebin.com/DJwA3S5P

并从中提取所有文本，从而将其转换为： https : //pastebin.com/qMqrj8RS

这是该函数可以执行的另一个示例：

如果给出以下内容：

<span id="sm_flash_225" onclick="sm_flash_process('bail', this,1)" onmouseover="sm_flash_add('bail', this, 1);" onmouseout="sm_flash_remove('bail', this, 1);">bail</span>

BeautifulSoup的get_text（）函数会将其简单地转换为： bail

换句话说，它需要<span id ="some_id" more random stuff...>text</span>并变成text 。

我有一个大格式字符串包含的网站的HTML文件。 我想编写与BeautifulSoup的get_text（）等效的Javascript，以便仅获取网页的文本。 我可以使用任何第三方库等，但是我不想重新发明轮子。 但是，值得注意的是，我是在Chrome / Firefox Web扩展程序的上下文中编写此代码的，所以我不相信我可以使用每个第3方库。

我使用以下代码获取了HTML文件：

fetch(url)
.then((resp) => resp.text())
.then(function (data) { 
    //get the text of the webpage by 
    //mimicking Beautiful Soup's get_text() function        
})

Answer 1

尝试这个：

fetch("test.html")
  .then(data => data.text())
  .then(text => {
    let div = document.createElement("div");
    div.innerHTML = text;
    console.log(div.textContent);
  });

Answer 2

不要将其他站点的实时HTML（和JS）插入到自己的站点中会更安全。 改用DOMParser：

 fetch("https://cors-anywhere.herokuapp.com/stackoverflow.com", ) .then(response => response.text()) .then(responseText => { const responseDocument = (new DOMParser()).parseFromString(responseText, 'text/html'); console.log(responseDocument.head.textContent); console.log(responseDocument.body.textContent); });

使用Javascript获取网页文本

问题描述

2 个解决方案

解决方案1
1 2018-04-01 05:16:41

解决方案2
1 已采纳 2018-04-01 05:30:57

使用Javascript获取网页文本

问题描述

2 个解决方案

解决方案1 1 2018-04-01 05:16:41

解决方案2 1 已采纳 2018-04-01 05:30:57

解决方案1
1 2018-04-01 05:16:41

解决方案2
1 已采纳 2018-04-01 05:30:57