简体   繁体   中英

is datastore mapreduce deprecated

I have just installed Google Cloud platform for a free trial. In order to run MapReduce tasks with DataStore , the docs says to run

./bdutil --upload_files "samples/*" run_command ./test-mr-datastore.sh

But I couldn't get this file on my local and there's a good reason for that, this way to run MapReduce jobs seem to be deprecated see this on github . Is that true, is there an alternative way to create MapReduce tasks from local command lines without requiring BigQuery ?

NOTE : Google team removed DataStore connector from bdutil v1.3.0 (2015-05-27) going forward so you might need to use older version or use GCS or BigQuery as proxy to accessing your data in DataStore.

I try to cover as much as I can, but bdutil is require lots more detail which is hard to document it in this answer, but I hope this can give you enough to start:

  • Setup Google Cloud SDK - detail

     # Download SDK curl https://sdk.cloud.google.com | bash # Restart your shell exec -l $SHELL # Authenticate to GCP gcloud auth login # Select Project gcloud config set project PROJECT_NAME 
  • Download and extract bdutil source code that contains DataStore connector.

     # Download source which contains DataStore connector wget https://github.com/GoogleCloudPlatform/bdutil/archive/1.2.1.tar.gz -O bdutil.tar.gz # Extract source tar -xvzf bdutil.tar.gz cd bdutil-*/ 
  • Create bdutil custom environment variable file. Please refer to bdutil configuration documentation about creating correct configuration file, since you need to specify project, number of servers, GCS bucket, machine type, etc...

  • Deploy your Hadoop instances ( Full documentation ) using datastore_env.sh

     ./bdutil deploy -e YOUR_ENV_FILE.sh,datastore_env.sh 
  • Connect to Hadoop Master node

     ./bdutil shell 
  • Now in Master node you can run your MapReduce Job which will have access to DataStore as well.

  • Turn down your Hadoop Cluster

      ./bdutil delete 

The Datastore connector connector is indeed deprecated.

To your question "is there an alternative way to create MapReduce tasks from local command line", one option is to use Google Cloud Dataflow . It's not MapReduce per se, but it's the programming model for parallel data processing which has replaced MapReduce at Google. The Dataflow SDK includes support for Datastore access .

Unlike Hadoop, you don't have to set up a cluster. You just write your code (using the Dataflow SDK) and submit your job from the CLI. The Datastore service will create the needed workers on the fly to process your job and then terminate them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM