简体   繁体   English

使用 NodeJS、Express、Cherio 和 Axios 抓取多个网站

[英]Scrape multiple websites using NodeJS, Express, Cherio and Axios

I would like to scrape multiple websites using NodeJS, Express, Cheerio and Axios. I'm able now to scrape 1 website and display the information to the HTML. But when I try to scrape multiple websites looking for the same element, it doesn't go through the forEach (stops after 1 cycle).我想使用 NodeJS、Express、Cheerio 和 Axios 抓取多个网站。我现在可以抓取 1 个网站并将信息显示到 HTML。但是当我尝试抓取多个网站寻找相同的元素时,它不会t go 通过 forEach(1 个循环后停止)。 Notice my loop which doesn't work correctly: urls.forEach(url => {请注意我的循环无法正常工作: urls.forEach(url => {

2 files that are the most important: index.js 2 个最重要的文件:index.js

const PORT = 8000
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const app = express()
const cors = require('cors')
app.use(cors())

const urls = ['https://www.google.nl','https://www.google.de']
// const url = 'https://www.heineken.com/nl/nl/'
app.get('/', function(req, res){
  res.json('Robin')
})

urls.forEach(url => {
  app.get('/results', (req, res) => {
    axios(url)
      .then(response => {
        const html = response.data
        const $ = cheerio.load(html)
        const articles = []

        $('script', html).each(function(){
          const link = $(this).get()[0].namespace
          if (link !== undefined) {
            if (link.indexOf('w3.org') > -1) {
             articles.push({
               link
             })
            }
          }
        })
        res.json(articles)
      }).catch(err => console.log(err))
 })
})

app.listen(PORT, () => console.log('server running on PORT ${PORT}'))

App.js:应用程序.js:

const root = document.querySelector('#root')

fetch('http://localhost:8000/results')
  .then(response => {return response.json()})
  .then(data => {
    console.log(data)
    data.forEach(article => {
      const title = `<h3>` + article.link + `</h3>`
      root.insertAdjacentHTML("beforeend", title)
    })
  })

You're registering multiple route handlers for the same route.您正在为同一条路线注册多个路线处理程序。 Express will only route requests to the first one. Express 只会将请求路由到第一个。 Move your URL loop inside app.get("/results", ...) ...将 URL循环移动到app.get("/results", ...) ...

app.get("/results", async (req, res, next) => {
  try {
    res.json(
      (
        await Promise.all(
          urls.map(async (url) => {
            const { data } = await axios(url);
            const $ = cheerio.load(data);
            const articles = [];

            $("script", html).each(function () {
              const link = $(this).get()[0].namespace;
              if (link !== undefined) {
                if (link.indexOf("w3.org") > -1) {
                  articles.push({
                    link,
                  });
                }
              }
            });
            return articles;
          })
        )
      ).flat() // un-nest each array of articles
    );
  } catch (err) {
    console.error(err);
    next(err); // make sure Express responds with an error
  }
});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM