简体   繁体   中英

Cannot find a tag with JSsoup even though the tag exists in Node JS

I've been experimenting with web scraping and wanted to try to do so with Node JS. I have some experience with web scraping in python using the requests module and BeautifulSoup4, and I wanted to recreate my code in Node JS. However, when basically mirroring my code (except changing some things to account for the difference in syntax) I cannot find the html tag I am looking for. I use JSsoup with Node JS since it is the closest thing I could find to BeautifulSoup. Here is my code so far:

const request = require('request');
var jssoup = require('jssoup').default;

const options = {
  url: 'https://kith.com/collections/footwear/products/nkaj7292-002.xml',
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
  }
};
function getVariant(error, response, body) {
  if (!error && response.statusCode == 200) {
      var soup = new jssoup(body);
      var nametag = soup.find('title');
      var product = nametag.text;
      console.log(product);
      var sizetag = soup.find('title', { string:'9' });
      console.log(sizetag);
  }
}
request(options, getVariant);

The code ends up finding one tag correctly ( <title> Nike Zoom Vomero 5/ACW (Black/Reflect Silver/Anthracite) AT3152-001 </title> ) but returns 'undefined' for the second tag. For reference, here is the tag it is trying to find: <title>9</title>

I have also tried using an = instead of a dictionary and using contents and name instead of string but no luck so far. What am I doing wrong here?

I tried looking at the JSsoup documentation too but it does not have much on find().

As one can see in the source , it is expecting that any string to be matched is provided as the 3rd argument to .find , thus:

let sizetag = soup.find('title', undefined, '9');

I agree with Scott Sauyet that opening an issue may be wise, especially for fixing the documentation

To get the innerText of <targetElement> with soup.find , use:

<targetElement>.contents[0]._text

I was also trying to scraping html within JSsoup in Node JS and found out it returns an object :

SoupTag {
  name: 'time',                           // name refers tagname
  contents: [ SoupString {.               // contents is array
      parent: [Circular *2],
      previousElement: [Circular *2],
      nextElement: [SoupTag],
      _text: '22 hours ago'              // here's innerText       
    }],
  attrs: { class: 'post-last-modified-td' },
  hidden: false,
  builder: TreeBuilder {
    EMPTY_ELEMENT_TAGS: Set(24) {...} 
  }
}

Here's my code:

find_time = soup.find("time", "post-last-modified-td");
if (find_update != undefined) console.log("Updated", find_time.contents[0]._text); 

It returns:

Updated 22 hours ago

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM