简体   繁体   中英

Puppeteer Different Behavior with Browser

I am getting differing results between querySelector commands in the console and the equivalent puppeteer code on nodejs. I'm using headless chromium and puppeteer-extra-plugin-stealth, so it probably isn't captcha or anything. Instead it seems to be some type of ad redirect that is messing everything up.

Using querySelector on the console, I can click on the button I want.

document.querySelectorAll("button[class*=OfferCta__]")[1].click()

This opens a modal on the same page. Sometimes it also opens a new tab with another website.

Unfortunately, when I run the equivalent on puppeteer via nodejs:

    const buttons = await page.$$("button[class*=OfferCta__]");
    await page.waitFor(2000);
    await buttons[1].click();
    await page.waitFor(6000);
    await page.screenshot({ path: "screenshot.png" });

All that shows up is an ad followed by lots of white space and nothing else: 在此处输入图片说明

The page I'm trying to scrape is https://www.retailmenot.com/view/myntra.com

Edit: Something interesting I discovered in the logs. Every two minutes, puppeteer seems to restart, as seen in the logs. It fails as usual the first few times, but on the fifth restart, ten minutes after I ran the code, it runs correctly, which I denoted with "done retrieving". I have a feeling this means that the success condition, whatever it is, is random?

Edit: It does seem to be random in timing. I opened up a headful={false} instance so I could watch what was happening. Puppeteer gets to varying stages of the correct result (landing page, page before it clicks, once it even managed to click successfully) before the lone ad above whitespace takes over.

The strange thing is, the url still shows the correct address, so it isn't a redirect either, even though it shows an ad and nothing else. Very strange.

I have the sinking feeling this may be an anti-bot feature. Detect a bot? Switch to ad and no content. If this is the case, perhaps I can play with time delays and see what works.

I have divined that it is a javascript script that is messing things up. I disabled javascript in puppeteer and it stopped giving me the ad. Unfortunately, that also breaks the button-press functionality I needed originally, so...I will see if I can find the offending script.

I have found the offending scripts:

https://www.retailmenot.com/tng/_next/static/chunks/34.c2f99cfb33704560c5d7.js and https://www.retailmenot.com/tng/_next/static/chunks/35.f67d49e4abce303212c6.js

Request blocking them with developer tools stopped the ads. As you can see, they both originate from same-site, so it WAS intentional. Bastards... grumble grumble.

Now, how to block these programatically...

I noticed something strange though, earlier. The scripts only ran when I moved my mouse a bit. That shouldn't happen when puppeteer is running right? Is there an option I can check to make puppeteer mouse undetectable?

It appears the script runs on start as well as when the mouse moves, so that explains the above. I have written a request blocker and got rid of a whole swathe of third-party scripts involving ads. Let's see if it works.

Edit: worked.

It turned out to be a script on the web page that was making it behave differently. I ended up blocking all third-party scripts with this code:

    page.on("request", (request) => {
      request.abort();
    });

Since the script that operated the button I needed was embedded in the html, this was fine. If you need to be more fine-grained in your request blocking, you can do this:

    page.on("request", (request) => {
      const url = request.url();
      const filters = [
        "https://www.retailmenot.com/tng/_next/static/chunks/",
        "https://www.retailmenot.com/thumbs/ops/promoContent/Site_SavingsEducation_StickyBanner_200x100.png",
        "btstatic",
        "googleadservices",
        "doubleclick",
        "idsync",
        "quant",
        "facebook",
        "amazon",
        "tracking",
        "taboola",
        ".gif",
        "google-analytics",
        "forter",
      ];

      const shouldAbort = filters.some(
        (urlPart) => url.includes(urlPart) && !url.includes("https://www.retailmenot.com/tng/_next/static/chunks/commons.")
      );
      //'https://www.retailmenot.com/tng/_next/static/chunks/34.c2f99cfb33704560c5d7.js'
      if (shouldAbort) request.abort();
      else {
        //console.log(url);
        request.continue();
      }
    });

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM