I want to extract text form the html, which named as text.html , as below
<div class="trans-container">
<ul>
<p class="wordGroup">
<span style="font-weight: bold; color: #959595; margin-right: .5em; width : 36px; display: inline-block;">adj.</span>
<span class="contentTitle"><a class="search-js" href="/w/good/#keyfrom=E2Ctranslation">good</a>
<span style="font-weight: bold; color: #959595;"> ;</span>
</span>
<span class="contentTitle"><a class="search-js" href="/w/fine/#keyfrom=E2Ctranslation">fine</a>
<span style="font-weight: bold; color: #959595;"> ;</span>
</span>
<span class="contentTitle"><a class="search-js" href="/w/ok/#keyfrom=E2Ctranslation">ok</a>
</span>
</p>
<p class="wordGroup">
<span style="font-weight: bold; color: #959595; margin-right: .5em; width : 36px; display: inline-block;">adv.</span>
<span class="contentTitle"><a class="search-js" href="/w/well/#keyfrom=E2Ctranslation">well</a>
</span>
</p>
<p class="wordGroup">
<span style="font-weight: bold; color: #959595; margin-right: .5em; width : 36px; display: inline-block;">misc.</span>
<span class="contentTitle"><a class="search-js" href="/w/all right/#keyfrom=E2Ctranslation">all right</a>
</span>
</p>
</ul>
</div>
and print it out as the following format.
adj. good ; fine ; ok
adv. well
misc. all right
What I've tried is the code below
const cheerio = require('cheerio');
const fs = require('fs');
const $ = cheerio.load(fs.readFileSync('./test.html'));
$('div.trans-container').find('p.wordGroup').each(function(i,elm){
const line = []
$(this).find('span').each(function(i,elm){
line[i] = $(this).text().trim()
})
console.log(line.join(' '))
});
Unfortunenately, the ourput is as below, not exactly as what I want. Can anyone help me to point out where I am wrong? Also, it would be greatly appreciated if you can offer me other decent ways to solve this problem by JavaScript , no matter with or without Cheerio .
adj. good
; ; fine
; ; ok
adv. well
misc. all right
This is perhaps the solution you are looking for
line[i] = $(this).children().length > 0 ? $(this).children(":first-child").text().trim() : $(this).text().trim();
This gives the expected output. This checks if this node has child nodes and gets the first node text only. If there are no child nodes then just extract the node text.
The official document about jquery text() function at http://api.jquery.com/text/ says
Get the combined text contents of each element in the set of matched elements, including their descendants, or set the text contents of the matched elements.
Another relevant post is this https://stackoverflow.com/a/32170000/578855
if you give an id to each one of your <p>
tags, then you can use this script to get access to your child elements and get the values from them :
var adjElements = document.getElementById("adj").children;
var advElements = document.getElementById("adv").children;
var miscElements = document.getElementById("misc").children;
var adjObject =[];
var advObject =[];
var miscObject =[];
for (var i=0; i<=adjElements.length -1; i++){
adjObject.push(adjElements[i].innerText);
}
for (var i=0; i<=advElements.length -1; i++){
advObject.push(advElements[i].innerText);
}
for (var i=0; i<=miscElements.length -1; i++){
miscObject.push(miscElements[i].innerText);
}
console.log(adjObject); //["adj.", "good ; ", "fine ; ", "ok"]
console.log(advObject); //["adv.", "well"]
console.log(miscObject); // ["misc.", "all right"]
I make an example for you :
https://jsfiddle.net/37g6ture/2/
remember to add adj
, adv
and misc
IDs to your p tag.
Your primary issue is the double loop. The inner one $(this).find('span').each
is causing some spans to be iterated over twice. For example:
<span class="contentTitle">
<a class="search-js" href="/w/fine/#keyfrom=E2Ctranslation">fine</a>
<span style="font-weight: bold; color: #959595;"> ;</span>
</span>
calling span.text()
on <span class="contentTitle">
will return fine ;
then also, the inner span <span style="font-weight: bold; color: #959595;">
is iterated over as well, adding a second ;
. Secondly, if your goal is to remove all extra white space, but leaving a single, this would work .replace(/\\s\\s+/g, ' '))
The whole code:
const $ = require('cheerio').load(require('fs').readFileSync('./test.html'));
$('div.trans-container').find('p.wordGroup').each(function(i,elm){
console.log($(this).text().replace(/\s\s+/g, ' '));
});
which results in
adj. good ; fine ; ok
adv. well
misc. all right
Just use text()
on the main group, .wordGroup
in this case, it will get all the text of the element without the html elements. Then run a replace()
on it removing all whitespace characters with a single space.
$('div.trans-container').find('p.wordGroup').each(function(i,elm){
// regex: /\s+/g matches 1 or more whitespace characters \n\r\f\t
var line = $(this).text().replace(/\s+/g," ");
console.log(line);
});
As for doing it with just native javascript you can't do that with Nodejs as it does not have native DOM support. So you have to use a module like cheerio or jsdom. If you mean javascript in the browser it would be like:
document.querySelectorAll('div.trans-container p.wordGroup')
.forEach(ele=>console.log( ele.innerText.replace(/\s+/g," ") ));
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.