简体   繁体   English

如何通过JavaScript / cheerio从以下html中提取文本?

[英]How to extract text from the following html as I want by JavaScript / cheerio?

I want to extract text form the html, which named as text.html , as below 我想从html提取文本,命名为text.html ,如下所示

<div class="trans-container">
  <ul>
     <p class="wordGroup">
        <span style="font-weight: bold; color: #959595; margin-right: .5em; width : 36px; display: inline-block;">adj.</span>
        <span class="contentTitle"><a class="search-js" href="/w/good/#keyfrom=E2Ctranslation">good</a>
        <span style="font-weight: bold; color: #959595;"> ;</span>
        </span>
        <span class="contentTitle"><a class="search-js" href="/w/fine/#keyfrom=E2Ctranslation">fine</a>
        <span style="font-weight: bold; color: #959595;"> ;</span>
        </span>
        <span class="contentTitle"><a class="search-js" href="/w/ok/#keyfrom=E2Ctranslation">ok</a>
        </span>
     </p>
     <p class="wordGroup">
        <span style="font-weight: bold; color: #959595; margin-right: .5em; width : 36px; display: inline-block;">adv.</span>
        <span class="contentTitle"><a class="search-js" href="/w/well/#keyfrom=E2Ctranslation">well</a>
        </span>
     </p>
     <p class="wordGroup">
        <span style="font-weight: bold; color: #959595; margin-right: .5em; width : 36px; display: inline-block;">misc.</span>
        <span class="contentTitle"><a class="search-js" href="/w/all right/#keyfrom=E2Ctranslation">all right</a>
        </span>
     </p>
  </ul>
</div>

and print it out as the following format. 并按以下格式打印出来。

adj. good ; fine ; ok
adv. well
misc. all right

What I've tried is the code below 我试过的是下面的代码

const cheerio = require('cheerio');
const fs = require('fs');

const $ = cheerio.load(fs.readFileSync('./test.html'));
$('div.trans-container').find('p.wordGroup').each(function(i,elm){
  const line = []
  $(this).find('span').each(function(i,elm){
    line[i] = $(this).text().trim()
  })
  console.log(line.join(' '))
});

Unfortunenately, the ourput is as below, not exactly as what I want. 不幸的是,我们的输出如下,而不完全是我想要的。 Can anyone help me to point out where I am wrong? 谁能帮我指出我错了吗? Also, it would be greatly appreciated if you can offer me other decent ways to solve this problem by JavaScript , no matter with or without Cheerio . 此外,将不胜感激,如果你能为我提供其他体面的方式由JavaScript,有无Cheerio不管来解决这个问题。

adj. good
         ; ; fine
         ; ; ok
adv. well
misc. all right

This is perhaps the solution you are looking for 这也许是您正在寻找的解决方案

line[i] = $(this).children().length > 0 ? $(this).children(":first-child").text().trim() : $(this).text().trim();

This gives the expected output. 这给出了预期的输出。 This checks if this node has child nodes and gets the first node text only. 这将检查此节点是否具有子节点并仅获取第一个节点文本。 If there are no child nodes then just extract the node text. 如果没有子节点,则只需提取节点文本。

The official document about jquery text() function at http://api.jquery.com/text/ says http://api.jquery.com/text/上有关jquery text()函数的官方文档说

Get the combined text contents of each element in the set of matched elements, including their descendants, or set the text contents of the matched elements. 获取匹配元素集合中每个元素的组合文本内容(包括它们的后代),或设置匹配元素的文本内容。

Another relevant post is this https://stackoverflow.com/a/32170000/578855 另一个相关的帖子是这个https://stackoverflow.com/a/32170000/578855

if you give an id to each one of your <p> tags, then you can use this script to get access to your child elements and get the values from them : 如果为每个<p>标记提供一个ID,则可以使用此脚本访问您的子元素并从中获取值:

var adjElements = document.getElementById("adj").children;
var advElements = document.getElementById("adv").children;
var miscElements = document.getElementById("misc").children;
var adjObject =[];
var advObject =[];
var miscObject =[];


for (var i=0; i<=adjElements.length -1; i++){
    adjObject.push(adjElements[i].innerText);
}

for (var i=0; i<=advElements.length -1; i++){
    advObject.push(advElements[i].innerText);
}

for (var i=0; i<=miscElements.length -1; i++){
    miscObject.push(miscElements[i].innerText);
}

console.log(adjObject); //["adj.", "good ; ", "fine ; ", "ok"]
console.log(advObject); //["adv.", "well"]
console.log(miscObject); //  ["misc.", "all right"]

I make an example for you : 我为你举例:

https://jsfiddle.net/37g6ture/2/ https://jsfiddle.net/37g6ture/2/

remember to add adj , adv and misc IDs to your p tag. 记住要在p标签中添加adjadvmisc ID。

Your primary issue is the double loop. 您的主要问题是双循环。 The inner one $(this).find('span').each is causing some spans to be iterated over twice. 内部的$(this).find('span').each导致一些跨度被迭代两次。 For example: 例如:

<span class="contentTitle">
    <a class="search-js" href="/w/fine/#keyfrom=E2Ctranslation">fine</a>
    <span style="font-weight: bold; color: #959595;"> ;</span>
</span>

calling span.text() on <span class="contentTitle"> will return fine ; <span class="contentTitle">上调用span.text()将返回span.text() fine ; then also, the inner span <span style="font-weight: bold; color: #959595;"> is iterated over as well, adding a second ; 然后还对内部跨度<span style="font-weight: bold; color: #959595;">进行迭代,并添加第二个; . Secondly, if your goal is to remove all extra white space, but leaving a single, this would work .replace(/\\s\\s+/g, ' ')) 其次,如果您的目标是删除所有多余的空格,但只保留一个空格,则可以使用.replace(/\\s\\s+/g, ' '))

The whole code: 整个代码:

const $ = require('cheerio').load(require('fs').readFileSync('./test.html'));
$('div.trans-container').find('p.wordGroup').each(function(i,elm){
  console.log($(this).text().replace(/\s\s+/g, ' '));
});

which results in 导致

adj. good ; fine ; ok 
adv. well 
misc. all right 

Just use text() on the main group, .wordGroup in this case, it will get all the text of the element without the html elements. 在这种情况下, .wordGroup在主要组.wordGroup上使用text() ,它将获得该元素的所有文本而没有html元素。 Then run a replace() on it removing all whitespace characters with a single space. 然后在其上运行replace() ,以单个空格删除所有空白字符。

$('div.trans-container').find('p.wordGroup').each(function(i,elm){
  // regex: /\s+/g matches 1 or more whitespace characters \n\r\f\t
  var line = $(this).text().replace(/\s+/g," ");
  console.log(line);
});

As for doing it with just native javascript you can't do that with Nodejs as it does not have native DOM support. 至于仅使用本机javascript,您就不能使用Node.js,因为它不具有本机DOM支持。 So you have to use a module like cheerio or jsdom. 因此,您必须使用诸如cheerio或jsdom之类的模块。 If you mean javascript in the browser it would be like: 如果您的意思是在浏览器中使用javascript,则可能是:

document.querySelectorAll('div.trans-container p.wordGroup')
  .forEach(ele=>console.log( ele.innerText.replace(/\s+/g," ") ));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM