简体   繁体   中英

How to retrieve the DOM object from the outer url in nodejs

I need to create the database with the random text from different websites to do tests for my module that works on sentences. The more sentences I've got, the better, because the more cases I can predict and cover in algorithms. I've began doing this manually but it took me 8 hours to retrieve only 500 pages of text what is not very efficient.

I'm wondering if it could be possible to load the website into some npm module so I could get the DOM object of this website and then use JS to retrieve text from eg. <p>, <h1-6>, <li> elements. In the web browsers, in F12 devTools window there's an access to the DOM. Would it be possible to get the access to the DOM with some desktop npm package likewise?

What I know is that there's no possibility to get the access to the DOM of outer website loaded into the iframe. What about using nodeJS from the desktop?

Well If I understood your question properly, i think this seems to be web scraping , and getting the DOM elements and things beneath under it , well if this is the case then you can use npm modules which does web scraping stuff, the one's which are quite well known npm modules are .

1. Cheerio: It's a server side version of jQuery , if your familiar with jQuery it would be hassle free to work on , moreover it's lightweight and flexible to use . Basically after fetching the remote content (ajax request or any http request ), which can be parsed same as dom selection in jQuery , one drawback with this, it's falls short in fetching dynamic content happened on the website or page .


2. JSDom: This is closest thing to headless browser, which provides close representation on the page or DOM , it uses websockets under its belt to return the content on the page and it also returns the dynamic content updated on the page

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM