简体   繁体   中英

Web Crawler using Cloud Dataflow

I would like Crawl 3 million web pages in a day. Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow

  1. Is it a good choice to use Selenium inside ParDo Fns? Can we use a single instance of Selenium across multiple instances?
  2. Is the same applicable Playwright, should I build a custom image?

You can do anything in a Python DoFn that you can do from Python. Yes, I would definitely use custom containers for complex dependencies like this.

You can share instances of Selenium (or any other object) per DoFn instance by initializing it in your setup method. You can share it for the whole process by using a module-level global or something like shared (noting that it may be accessed by more than one thread at once).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM