简体   繁体   English

用cheerio爬到系统脚本

[英]crawling to system script with cheerio

I try to do crawling to the system and the structure of a system like:我尝试对系统和系统结构进行爬网,例如:

<body>
<script type="text/javascript" >
        </script>
    <script type="text/javascript" >
    COMPANY_DATA = {} //object js
        </script>
<script type="text/javascript" >
        </script>
</body>

Another part of sites另一部分网站

<body>
    <script type="text/javascript" >
    COMPANY_DATA = {} //object js
        </script>
<script type="text/javascript" >
        </script>

</body>

My function我的 function

const crawling = async(url) =>{
      
    axios
      .get(url)
      .then((response) => {
           const html = response.data;
    var str,
      $ = cheerio.load(html, { xmlMode: true });

    str = $("script:not([src])")[0].children[0].data;
    const regex = /(?<=COMPANY_POSITIONS_DATA = ).*/gim;

    const data = str.match(regex);
    const dataEval = eval(data[0]);
          console.log(dataEval);
        });
      }

Some of the sites work, some don't, they don't have the same structure, how can I run all the scripts有些网站工作,有些不工作,他们没有相同的结构,我怎么能运行所有的脚本

First and foremost:首先也是最重要的:

Do not use eval on arbitrary data from the web!!不要对来自网络的任意数据使用eval !!

The possibility is low, but if the payload is require("child_process").exec("rm <ADDED FOR SAFETY> -rf /") then you'll wipe your disk.可能性很低,但如果有效负载是require("child_process").exec("rm <ADDED FOR SAFETY> -rf /")那么你会擦除你的磁盘。 And that's just one example of a malicious script.这只是恶意脚本的一个例子。 Others may be much more subtle.其他人可能更微妙。 Why risk it when there are better options?当有更好的选择时,为什么要冒险呢?

JSON.parse() is safer and works on the two sample pages you shared, which contain object structures that are also valid JSON. JSON.parse()更安全,适用于您共享的两个示例页面,其中包含 object 结构,这些结构也是有效的 JSON。

// run with: node --insecure-http-parser

const axios = require("axios"); // ^0.21.4
const cheerio = require("cheerio"); // 1.0.0-rc.12

// optionally don't truncate printed objects
// https://stackoverflow.com/a/41882441/6243352
require("util").inspect.defaultOptions.depth = null;

const extractCompanyPositions = html => {
  const $ = cheerio.load(html);
  return [...$("script:not([src])")].map(e => {
    const reg = /^ *COMPANY_POSITIONS_DATA *= *(.*)$/m;
    const match = $(e).text().match(reg);
    return match && JSON.parse(match[1].replace(/;+$/, ""));
  }).filter(Boolean)
};

const urls = [
  "https://www.comeet.com/jobs/accessibe/D5.00B",
  "https://www.comeet.com/jobs/razorlabs/A5.002",
  // add more URLs here
];

Promise.all(urls.map(url => 
  axios.get(url).then(({data}) => extractCompanyPositions(data))
)).then(results => {

  // flatten if you want to merge all scripts from all sites into one array
  console.log(results.flat(2));
});
require("util").inspect.defaultOptions.depth = null;

const extract = async (url) => {
  axios.get(url).then((response) => {
    const html = response.data;
    const $ = cheerio.load(html);
    $("script:not([src])").map(e => {
      const match = $(e).text().match(/^ *COMPANY_POSITIONS_DATA *= *(.*)$/m);
       match && JSON.parse(match[1].replace(/;+$/, ""));
    }).filter(Boolean)
  })
  const $ = cheerio.load(html);

};

const urls = [
  "https://www.comeet.com/jobs/accessibe/D5.00B",
  "https://www.comeet.com/jobs/razorlabs/A5.002",
];
const start = async () => {
  const crawlCalls = urls.map(extract);
  const crawlResults = await Promise.all(crawlCalls);
  };

  start()

I try to do something like this and export the function to scheduler....我尝试做这样的事情并将 function 导出到调度程序....

const startComeet=Promise.all(urls.map(url => 
  axios.get(url).then(({data}) => extractCompanyPositions(data))
)).then(
  results => {
  for (let i = 0; i < results.length; i++) {  
  
   results[i][0].some(async function (job) {
    let title = job?.name;
    let location = job?.location?.name;
    let locationCountry = job?.location?.country;
    let locationCity = job?.location?.city;
    let companyName = job?.company_name;
    let idJob = `${companyName}-${job.uid}`;
    let link = job?.url_comeet_hosted_page;

      await saveData(title, link, location, idJob,companyName)
    
  });
}
});
module.exports = startComeet;

But I get stuck to calling this TypeError: startComeet is not a function how I can export and call like function and look nicer And thanks a lot!但我坚持调用这个 TypeError: startComeet is not a function 我如何像 function 那样导出和调用,看起来更好,非常感谢!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM