I would like Crawl 3 million web pages in a day. Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow
You can do anything in a Python DoFn that you can do from Python. Yes, I would definitely use custom containers for complex dependencies like this.
You can share instances of Selenium (or any other object) per DoFn instance by initializing it in your setup method. You can share it for the whole process by using a module-level global or something like shared (noting that it may be accessed by more than one thread at once).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.