简体   繁体   English

使用Javascript获取网页文本

[英]Getting the text of a webpage with Javascript

BeautifulSoup, a python library, has a function called get_text() that can take a parsed HTML page, such as this: https://pastebin.com/DJwA3S5P python库BeautifulSoup具有一个名为get_text()的函数,该函数可以获取已解析的HTML页面,例如: https ://pastebin.com/DJwA3S5P

and extract all of the text from it, thus turning it into this: https://pastebin.com/qMqrj8RS 并从中提取所有文本,从而将其转换为: https : //pastebin.com/qMqrj8RS

Here's another example of what the function can do: 这是该函数可以执行的另一个示例:

If given the following: 如果给出以下内容:

<span id="sm_flash_225" onclick="sm_flash_process('bail', this,1)" onmouseover="sm_flash_add('bail', this, 1);" onmouseout="sm_flash_remove('bail', this, 1);">bail</span> 

BeautifulSoup's get_text() function will simply turn it into: bail BeautifulSoup的get_text()函数会将其简单地转换为: bail

In other words, it takes <span id ="some_id" more random stuff...>text</span> and turns into into text . 换句话说,它需要<span id ="some_id" more random stuff...>text</span>并变成text

I have the HTML file of a website that is contained as one large formatted string. 我有一个大格式字符串包含的网站的HTML文件。 I would like to write the Javascript equivalent of BeautifulSoup's get_text() in order to only get the text of the webpage. 我想编写与BeautifulSoup的get_text()等效的Javascript,以便仅获取网页的文本。 I'm fine with using any third party library etc., I don't want to re-invent the wheel. 我可以使用任何第三方库等,但是我不想重新发明轮子。 However, it's worth noting that I'm writing this in the context of a Chrome/Firefox web extension, so I don't believe I can use every single 3rd party library. 但是,值得注意的是,我是在Chrome / Firefox Web扩展程序的上下文中编写此代码的,所以我不相信我可以使用每个第3方库。

I acquired the HTML file with the following code: 我使用以下代码获取了HTML文件:

fetch(url)
.then((resp) => resp.text())
.then(function (data) { 
    //get the text of the webpage by 
    //mimicking Beautiful Soup's get_text() function        
})

try this: 尝试这个:

fetch("test.html")
  .then(data => data.text())
  .then(text => {
    let div = document.createElement("div");
    div.innerHTML = text;
    console.log(div.textContent);
  });

Safer not to insert live HTML (and JS) from another site onto your own. 不要将其他站点的实时HTML(和JS)插入到自己的站点中会更安全。 Use DOMParser instead: 改用DOMParser:

 fetch("https://cors-anywhere.herokuapp.com/stackoverflow.com", ) .then(response => response.text()) .then(responseText => { const responseDocument = (new DOMParser()).parseFromString(responseText, 'text/html'); console.log(responseDocument.head.textContent); console.log(responseDocument.body.textContent); }); 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM