简体   繁体   中英

Scrape a javascript-generated website in C# without installing a browser

I am developing a website crawler API to scrape a javascript-generated website. The website that we are crawling requires the Javascript to be enabled to fully-render the HTML. I have tried many solutions such as HtmlAgilityPack and AngleSharp, but they are just HTML parsers and they cannot render the page due to missing Javascript capability.

I tried implementing headless browser using Selenium.WebDriver.ChromeDriver, it worked very well in my local machine. However, our production environment is very limited such that only Internet Explorer browser is available and we are not allowed to install any more browser. So this chromedriver did not work, too. Internet Explorer cannot even fully render the website from the browser itself. So IE is definitely out.

Is there a way to scrape a javascript-generated website without having to install a browser? Like implementing a headless browser on a server without that browser installed? Or is it a dead-end situation. Thanks!

You can try using a solution that uses a fully-functional built-in Chromium and doesn't require installing Google Chrome in the target environment. All the required Chromium binaries will be shipped with the solution.

There are many such solutions for .NET and C#:

CefSharp

An open source .NET wrapper around the Chromium Embedded Framework (CEF). It allows you to embed Chromium in .NET apps.

Supported by community. If you need help with the library use, read docs or ask community. If you need a feature or a bug fix, you would probably need to do it by yourself.

DotNetBrowser

A commercial library that allows integrating a Chromium-based browser with your .NET app to display and process HTML5, CSS3, JavaScript, etc.

It's a proprietary solution supported by a commercial company. If you need help with the library use, read docs or get help from the engineers of this product. If you need a feature or a bug fix, it will be done by the product team as soon as possible. I know that, because I know the engineers from DotNetBrowser team.

WebView2

This control allows you to embed web technologies (HTML, CSS, and JavaScript) in your native apps. The WebView2 control uses Microsoft Edge (Chromium) as the rendering engine to display the web content in native apps. With WebView2 , you can embed web code in different parts of your native app, or build all of the native app within a single WebView instance. Supported by Microsoft.

If you need some help, you should contact WebView2 team .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM