简体   繁体   中英

How to wait until web page is loaded before scraping HTML using Puppeteer in headless mode? (C#)

I am trying to scrape from a website ( www.Vinted.co.uk ) which uses JavaScript to load data, unfortunately, the data loaded by JavaScript is what I'm scraping so I need to wait for the page to load before scraping so I can get the data required.

At the moment I am using Puppeteer and I have managed to get it working, however, a web browser is physically launching each time, at the moment its not working in headless mode unfortunately, it doesn't wait until the web page has loaded in headless mode even though I'm calling the WaitUntilNavigation.DOMContentLoaded method, so the data doesn't exist in the HTML when calling the GetContentAsync method.

Here is how my codes looking (C#):

public static async Task<string> GetLoadedHTML(string url)
    {
        try
        {
            await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
            Browser browser = await Puppeteer.LaunchAsync(new LaunchOptions
            {
                Headless = false
            });
            var page = await browser.NewPageAsync();
            page.DefaultTimeout = 0;
            var navigation = new NavigationOptions
            {
                Timeout = 0,
                WaitUntil = new[] {
                WaitUntilNavigation.DOMContentLoaded }
            };
            await page.GoToAsync(url, navigation);
            string content = await page.GetContentAsync();
            await browser.CloseAsync();
            page.Dispose();

            return content;
        }
        catch (Exception ex)
        {
            log.Error(ex);
            throw ex;
        } 
    }

I might go down a different route than Puppeteer if anyone has any recommendations, if possible I'd prefer to not need to open a physical browser each time as I'm hoping to run this as a service so it would be problematic. It would be good to get this working in headless mode as I believe that solves my issue since the browser wouldn't be launching then.

Any help appreciated.

You can wait for some selector that might tell you that the page is ready. eg:

await page.WaitForSelectorAsync(".someSelector");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM