简体   繁体   English

Web 爬虫使用 Cloud Dataflow

[英]Web Crawler using Cloud Dataflow

I would like Crawl 3 million web pages in a day.我想在一天内爬取 300 万个 web 页面。 Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow

  1. Is it a good choice to use Selenium inside ParDo Fns?在 ParDo Fns 中使用 Selenium 是不是一个不错的选择? Can we use a single instance of Selenium across multiple instances?我们可以跨多个实例使用 Selenium 的单个实例吗?
  2. Is the same applicable Playwright, should I build a custom image?是同样适用的剧作家,我应该建立一个自定义图像吗?

You can do anything in a Python DoFn that you can do from Python.您可以在 Python DoFn 中执行任何您可以在 Python 中执行的操作。 Yes, I would definitely use custom containers for complex dependencies like this.是的,我肯定会为这样的复杂依赖项使用自定义容器。

You can share instances of Selenium (or any other object) per DoFn instance by initializing it in your setup method.您可以通过在您的设置方法中初始化每个 DoFn 实例来共享 Selenium(或任何其他对象)的实例。 You can share it for the whole process by using a module-level global or something like shared (noting that it may be accessed by more than one thread at once).您可以通过使用模块级全局或类似shared的方式为整个进程共享它(注意它可能一次被多个线程访问)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Cloud Dataflow 作业的调度 - Schedulling for Cloud Dataflow Job 触发云存储 - 数据流 - Trigger Cloud Storage - Dataflow 是否可以在云数据流谷歌云平台中使用 apache 光束执行存储过程 MySQL Azure? - Is possible to execute Stored Procedure MySQL Azure using apache beam in cloud dataflow google cloud platform? Cloud Dataflow 中的失败作业:启用 Dataflow API - Failed job in Cloud Dataflow: enable Dataflow API 使用谷歌云中的数据流从云存储中读取数百万个文件的问题 - Issue with reading millions of files from cloud storage using dataflow in Google cloud 使用 Google Cloud Dataflow flex 模板时,是否可以使用多命令 CLI 来运行作业? - When using Google Cloud Dataflow flex templates, is it possible to use a multi-command CLI to run a job? Spring Cloud Dataflow 与 Apache Beam/GCP 数据流说明 - Spring Cloud Dataflow vs Apache Beam/GCP Dataflow Clarification 在 google-cloud-dataflow 中使用文件模式匹配时如何获取文件名 - How to Get Filename when using file pattern match in google-cloud-dataflow 如何使用 Pub/Sub 和数据流将 HTML 字符串从 Cloud Functions 发送到 BigQuery? - How to send HTML string from Cloud Functions to BigQuery using Pub/Sub and Dataflow? 使用模板部署时是否可以更新现有的 Google 云数据流管道? - Is it possible to update existing Google cloud dataflow pipelines when using template for deployment ?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM