I've been experimenting with web scraping and wanted to try to do so with Node JS. I have some experience with web scraping in python using the requests module and BeautifulSoup4, and I wanted to recreate my code in Node JS. However, when basically mirroring my code (except changing some things to account for the difference in syntax) I cannot find the html tag I am looking for. I use JSsoup with Node JS since it is the closest thing I could find to BeautifulSoup. Here is my code so far:
const request = require('request');
var jssoup = require('jssoup').default;
const options = {
url: 'https://kith.com/collections/footwear/products/nkaj7292-002.xml',
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
}
};
function getVariant(error, response, body) {
if (!error && response.statusCode == 200) {
var soup = new jssoup(body);
var nametag = soup.find('title');
var product = nametag.text;
console.log(product);
var sizetag = soup.find('title', { string:'9' });
console.log(sizetag);
}
}
request(options, getVariant);
The code ends up finding one tag correctly ( <title> Nike Zoom Vomero 5/ACW (Black/Reflect Silver/Anthracite) AT3152-001 </title>
) but returns 'undefined' for the second tag. For reference, here is the tag it is trying to find: <title>9</title>
I have also tried using an = instead of a dictionary and using contents and name instead of string but no luck so far. What am I doing wrong here?
I tried looking at the JSsoup documentation too but it does not have much on find().
As one can see in the source , it is expecting that any string
to be matched is provided as the 3rd argument to .find
, thus:
let sizetag = soup.find('title', undefined, '9');
I agree with Scott Sauyet that opening an issue may be wise, especially for fixing the documentation
To get the innerText of <targetElement> with soup.find
, use:
<targetElement>.contents[0]._text
I was also trying to scraping html within JSsoup in Node JS and found out it returns an object :
SoupTag {
name: 'time', // name refers tagname
contents: [ SoupString {. // contents is array
parent: [Circular *2],
previousElement: [Circular *2],
nextElement: [SoupTag],
_text: '22 hours ago' // here's innerText
}],
attrs: { class: 'post-last-modified-td' },
hidden: false,
builder: TreeBuilder {
EMPTY_ELEMENT_TAGS: Set(24) {...}
}
}
Here's my code:
find_time = soup.find("time", "post-last-modified-td");
if (find_update != undefined) console.log("Updated", find_time.contents[0]._text);
It returns:
Updated 22 hours ago
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.