简体   繁体   English

是否已弃用数据存储区mapreduce

[英]is datastore mapreduce deprecated

I have just installed Google Cloud platform for a free trial. 我刚刚安装了Google Cloud platform以进行免费试用。 In order to run MapReduce tasks with DataStore , the docs says to run 为了使用DataStore运行MapReduce任务, 文档说要运行

./bdutil --upload_files "samples/*" run_command ./test-mr-datastore.sh

But I couldn't get this file on my local and there's a good reason for that, this way to run MapReduce jobs seem to be deprecated see this on github . 但是我无法在本地获取此文件,并且这样做有充分的理由,这种运行MapReduce作业的方法似乎已被弃用,请参见github上的内容 Is that true, is there an alternative way to create MapReduce tasks from local command lines without requiring BigQuery ? 确实如此,是否有另一种方法可以从本地命令行创建MapReduce任务而不需要BigQuery

NOTE : Google team removed DataStore connector from bdutil v1.3.0 (2015-05-27) going forward so you might need to use older version or use GCS or BigQuery as proxy to accessing your data in DataStore. 注意 :Google团队随后bdutil (2015-05-27)中删除了DataStore连接器 ,因此您可能需要使用旧版本或使用GCS或BigQuery作为代理来访问DataStore中的数据。

I try to cover as much as I can, but bdutil is require lots more detail which is hard to document it in this answer, but I hope this can give you enough to start: 我会尽我所能,但是bdutil需要更多细节,很难在此答案中进行记录,但是我希望这能给您足够的起点:

  • Setup Google Cloud SDK - detail 设置Google Cloud SDK- 详细信息

     # Download SDK curl https://sdk.cloud.google.com | bash # Restart your shell exec -l $SHELL # Authenticate to GCP gcloud auth login # Select Project gcloud config set project PROJECT_NAME 
  • Download and extract bdutil source code that contains DataStore connector. 下载并解压缩包含DataStore连接器的bdutil源代码。

     # Download source which contains DataStore connector wget https://github.com/GoogleCloudPlatform/bdutil/archive/1.2.1.tar.gz -O bdutil.tar.gz # Extract source tar -xvzf bdutil.tar.gz cd bdutil-*/ 
  • Create bdutil custom environment variable file. 创建bdutil定制环境变量文件。 Please refer to bdutil configuration documentation about creating correct configuration file, since you need to specify project, number of servers, GCS bucket, machine type, etc... 请参考bdutil配置文档获取正确的配置文件,因为您需要指定项目,服务器数量,GCS存储桶,计算机类型等。

  • Deploy your Hadoop instances ( Full documentation ) using datastore_env.sh 使用datastore_env.sh部署Hadoop实例( 完整文档

     ./bdutil deploy -e YOUR_ENV_FILE.sh,datastore_env.sh 
  • Connect to Hadoop Master node 连接到Hadoop Master节点

     ./bdutil shell 
  • Now in Master node you can run your MapReduce Job which will have access to DataStore as well. 现在,在主节点中,您可以运行MapReduce作业,该作业也将有权访问DataStore。

  • Turn down your Hadoop Cluster 拒绝您的Hadoop集群

      ./bdutil delete 

The Datastore connector connector is indeed deprecated. 确实不推荐使用数据存储区连接器连接器。

To your question "is there an alternative way to create MapReduce tasks from local command line", one option is to use Google Cloud Dataflow . 对于您的问题“是否存在从本地命令行创建MapReduce任务的另一种方法”,一种选择是使用Google Cloud Dataflow It's not MapReduce per se, but it's the programming model for parallel data processing which has replaced MapReduce at Google. 它本身不是MapReduce,而是并行数据处理的编程模型,已在Google取代了MapReduce。 The Dataflow SDK includes support for Datastore access . Dataflow SDK包括对数据存储访问的支持

Unlike Hadoop, you don't have to set up a cluster. 与Hadoop不同,您不必设置集群。 You just write your code (using the Dataflow SDK) and submit your job from the CLI. 您只需编写代码(使用Dataflow SDK)并从CLI提交作业。 The Datastore service will create the needed workers on the fly to process your job and then terminate them. 数据存储服务将即时创建所需的工作人员来处理您的工作,然后终止他们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM