简体   繁体   中英

How to extract text from the following html as I want by JavaScript / cheerio?

I want to extract text form the html, which named as text.html , as below

<div class="trans-container">
  <ul>
     <p class="wordGroup">
        <span style="font-weight: bold; color: #959595; margin-right: .5em; width : 36px; display: inline-block;">adj.</span>
        <span class="contentTitle"><a class="search-js" href="/w/good/#keyfrom=E2Ctranslation">good</a>
        <span style="font-weight: bold; color: #959595;"> ;</span>
        </span>
        <span class="contentTitle"><a class="search-js" href="/w/fine/#keyfrom=E2Ctranslation">fine</a>
        <span style="font-weight: bold; color: #959595;"> ;</span>
        </span>
        <span class="contentTitle"><a class="search-js" href="/w/ok/#keyfrom=E2Ctranslation">ok</a>
        </span>
     </p>
     <p class="wordGroup">
        <span style="font-weight: bold; color: #959595; margin-right: .5em; width : 36px; display: inline-block;">adv.</span>
        <span class="contentTitle"><a class="search-js" href="/w/well/#keyfrom=E2Ctranslation">well</a>
        </span>
     </p>
     <p class="wordGroup">
        <span style="font-weight: bold; color: #959595; margin-right: .5em; width : 36px; display: inline-block;">misc.</span>
        <span class="contentTitle"><a class="search-js" href="/w/all right/#keyfrom=E2Ctranslation">all right</a>
        </span>
     </p>
  </ul>
</div>

and print it out as the following format.

adj. good ; fine ; ok
adv. well
misc. all right

What I've tried is the code below

const cheerio = require('cheerio');
const fs = require('fs');

const $ = cheerio.load(fs.readFileSync('./test.html'));
$('div.trans-container').find('p.wordGroup').each(function(i,elm){
  const line = []
  $(this).find('span').each(function(i,elm){
    line[i] = $(this).text().trim()
  })
  console.log(line.join(' '))
});

Unfortunenately, the ourput is as below, not exactly as what I want. Can anyone help me to point out where I am wrong? Also, it would be greatly appreciated if you can offer me other decent ways to solve this problem by JavaScript , no matter with or without Cheerio .

adj. good
         ; ; fine
         ; ; ok
adv. well
misc. all right

This is perhaps the solution you are looking for

line[i] = $(this).children().length > 0 ? $(this).children(":first-child").text().trim() : $(this).text().trim();

This gives the expected output. This checks if this node has child nodes and gets the first node text only. If there are no child nodes then just extract the node text.

The official document about jquery text() function at http://api.jquery.com/text/ says

Get the combined text contents of each element in the set of matched elements, including their descendants, or set the text contents of the matched elements.

Another relevant post is this https://stackoverflow.com/a/32170000/578855

if you give an id to each one of your <p> tags, then you can use this script to get access to your child elements and get the values from them :

var adjElements = document.getElementById("adj").children;
var advElements = document.getElementById("adv").children;
var miscElements = document.getElementById("misc").children;
var adjObject =[];
var advObject =[];
var miscObject =[];


for (var i=0; i<=adjElements.length -1; i++){
    adjObject.push(adjElements[i].innerText);
}

for (var i=0; i<=advElements.length -1; i++){
    advObject.push(advElements[i].innerText);
}

for (var i=0; i<=miscElements.length -1; i++){
    miscObject.push(miscElements[i].innerText);
}

console.log(adjObject); //["adj.", "good ; ", "fine ; ", "ok"]
console.log(advObject); //["adv.", "well"]
console.log(miscObject); //  ["misc.", "all right"]

I make an example for you :

https://jsfiddle.net/37g6ture/2/

remember to add adj , adv and misc IDs to your p tag.

Your primary issue is the double loop. The inner one $(this).find('span').each is causing some spans to be iterated over twice. For example:

<span class="contentTitle">
    <a class="search-js" href="/w/fine/#keyfrom=E2Ctranslation">fine</a>
    <span style="font-weight: bold; color: #959595;"> ;</span>
</span>

calling span.text() on <span class="contentTitle"> will return fine ; then also, the inner span <span style="font-weight: bold; color: #959595;"> is iterated over as well, adding a second ; . Secondly, if your goal is to remove all extra white space, but leaving a single, this would work .replace(/\\s\\s+/g, ' '))

The whole code:

const $ = require('cheerio').load(require('fs').readFileSync('./test.html'));
$('div.trans-container').find('p.wordGroup').each(function(i,elm){
  console.log($(this).text().replace(/\s\s+/g, ' '));
});

which results in

adj. good ; fine ; ok 
adv. well 
misc. all right 

Just use text() on the main group, .wordGroup in this case, it will get all the text of the element without the html elements. Then run a replace() on it removing all whitespace characters with a single space.

$('div.trans-container').find('p.wordGroup').each(function(i,elm){
  // regex: /\s+/g matches 1 or more whitespace characters \n\r\f\t
  var line = $(this).text().replace(/\s+/g," ");
  console.log(line);
});

As for doing it with just native javascript you can't do that with Nodejs as it does not have native DOM support. So you have to use a module like cheerio or jsdom. If you mean javascript in the browser it would be like:

document.querySelectorAll('div.trans-container p.wordGroup')
  .forEach(ele=>console.log( ele.innerText.replace(/\s+/g," ") ));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM