简体   繁体   English

如何从具有多个文档 html 的 web 页面中获取元素的选择器?

[英]How get the selector of an element from a web page with more than one document html?

I try get information from a web page using puppeteer, but in I don't to find the selector tha I need, I suppose that's because the page contain more than one documents html and I can't to find the way for to get the data that I need.我尝试使用 puppeteer 从 web 页面获取信息,但是我没有找到我需要的选择器,我想这是因为该页面包含多个文档 html 并且我找不到获取我需要的数据。

the inpection of the page页面检查

that´s the code:那是代码:

const puppeteer = require('puppeteer');

(async ()=>{
    const browser = await puppeteer.launch({headless:false});

    const page = await browser.newPage();

    await page.goto('https://www.arrivia.com/careers/job-openings/');

    await page.waitForSelector('.job-search-result');

    const data = await page.evaluate(()=>{
        const elements = document.querySelectorAll('.job-search-result .job-btn-container a');
            
        vacancies = [];
        
        for(element of elements){
            vacancies.push(element.href);
        }

        return vacancies;
    });

    console.log(data.length);

    const vacancies = [];
    
    for (let i = 0; i <=2; i++){
        var urljob = data[i];
        await page.goto(data[i]);
        await page.waitForSelector(".app-title"); //that´s one of the selectors that I can´t to find 
from here I get an error`enter code here`
        const jobs = await page.evaluate((urljob)=> {
            const job = {};
            job.title = document.querySelector(".app-title").innerText;
            job.location = document.querySelector(".location").innerText;
            job.url = urljob;
            return job;close
        });

        vacancies.push(jobs);
    }

    console.log(vacancies);
    //await page.screenshot({ path: 'xx1.jpg'});

    await browser.close()

})();

Iframes are not always the easiest things to deal with, in Puppeteer.在 Puppeteer 中,iframe并不总是最容易处理的事情 But a way to bypass this could be to access directly the URL of the iframe, instead of accessing the page which hosts the iframe.但绕过此问题的一种方法可能是直接访问 iframe 的 URL,而不是访问托管 iframe 的页面。 It's also faster:它也更快:

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch({ headless: false, defaultViewport: null });

  const page = await browser.newPage();
  await page.goto("https://www.arrivia.com/careers/job-openings/", {
    waitUntil: "domcontentloaded",
  });

  const jobUrls = await page.$$eval(".job-search-result .job-btn-container a",
                                    els => els.map(el => el.href));

  const vacancies = [];

  for (let i = 0; i < 10; i++) { // don't forget to replace 10 with jobUrls.length later
    const url = jobUrls[i];
    const jobId = /job_id=(\d+)/.exec(url)[1]; // Extract the ID from the link
    await page.goto(
      `https://boards.greenhouse.io/embed/job_app?token=${jobId}`, // Go to iframe URL
      { waitUntil: "domcontentloaded" }
    );
    vacancies.push({
      title: await page.$eval(".app-title", el => el.innerText),
      location: await page.$eval(".location", el => el.innerText),
      url,
    });
  }

  console.log(vacancies);

  await browser.close();
})();

Output: Output:

[
  {
    title: 'Director of Account Management',
    location: 'Scottsdale, AZ',
    url: 'https://www.arrivia.com/careers/job/?job_id=2529695'
  },
  {
    title: "Site Admin and Director's Assistant",
    location: 'Albufeira, Portugal',
    url: 'https://www.arrivia.com/careers/job/?job_id=2540303'
  },
  ...
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM