[英]Web Crawler using Cloud Dataflow
I would like Crawl 3 million web pages in a day.我想在一天内爬取 300 万个 web 页面。 Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow
Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow
You can do anything in a Python DoFn that you can do from Python.您可以在 Python DoFn 中执行任何您可以在 Python 中执行的操作。 Yes, I would definitely use custom containers for complex dependencies like this.
是的,我肯定会为这样的复杂依赖项使用自定义容器。
You can share instances of Selenium (or any other object) per DoFn instance by initializing it in your setup method.您可以通过在您的设置方法中初始化每个 DoFn 实例来共享 Selenium(或任何其他对象)的实例。 You can share it for the whole process by using a module-level global or something like shared (noting that it may be accessed by more than one thread at once).
您可以通过使用模块级全局或类似shared的方式为整个进程共享它(注意它可能一次被多个线程访问)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.