简体   繁体   English

将Google数据存储区备份从数据存储加载到Google BigQuery

[英]Load Google Datastore Backups from Data Storage to Google BigQuery

Our requirement is to programmatically backup Google Datastore and load these backups to Google Big query for further analysis. 我们的要求是以编程方式备份​​Google Datastore并将这些备份加载到Google Big查询以进行进一步分析。 We were successful in automating backups using the following approach 我们使用以下方法成功实现了备份自动化

        Queue queue = QueueFactory.getQueue("datastoreBackupQueue");

        /*
         * Create a task which is equivalent to the backup URL mentioned in
         * above cron.xml, using new queue which has Datastore admin enabled
         */
        TaskOptions taskOptions = TaskOptions.Builder.withUrl("/_ah/datastore_admin/backup.create")
                .method(TaskOptions.Method.GET).param("name", "").param("filesystem", "gs")
                .param("gs_bucket_name",
                        "db-backup" + "/" + TimeUtils.parseDateToString(new Date(), "yyyy/MMM/dd"))
                .param("queue", queue.getQueueName());

        /*
         * Get list of dynamic entity kind names from the datastore based on
         * the kinds present in the datastore at the start of backup
         */
        List<String> entityNames = getEntityNamesForBackup();
        for (String entityName : entityNames) {
            taskOptions.param("kind", entityName);
        }

        /* Add this task to above queue */
        queue.add(taskOptions);

I was able to then import this backups to Google Bigquery manually, But how do we automate this process? 然后我可以手动将此备份导入Google Bigquery,但我们如何自动执行此过程?

I have also looked at most of the docs and nothing helped https://cloud.google.com/bigquery/docs/loading-data-cloud-storage#loading_data_from_google_cloud_storage 我也查看了大部分文档,没有任何帮助https://cloud.google.com/bigquery/docs/loading-data-cloud-storage#loading_data_from_google_cloud_storage

I have solved this myself, Here is the solution using JAVA The following code will pickup the backup files from GoogleCloud storage and load it into Google Big Query. 我自己解决了这个问题,以下是使用JAVA的解决方案以下代码将从GoogleCloud存储中提取备份文件并将其加载到Google Big Query中。

        AppIdentityCredential bqCredential = new AppIdentityCredential(
                Collections.singleton(BigqueryScopes.BIGQUERY));

        AppIdentityCredential dsCredential = new AppIdentityCredential(
                Collections.singleton(StorageScopes.CLOUD_PLATFORM));

        Storage storage = new Storage(HTTP_TRANSPORT, JSON_FACTORY, dsCredential);
        Objects list = storage.objects().list(bucket).setPrefix(prefix).setFields("items/name").execute();

        if (list == null) {
            Log.severe(BackupDBController.class, "BackupToBigQueryController",
                    "List from Google Cloud Storage was null", null);
        } else if (list.isEmpty()) {
            Log.severe(BackupDBController.class, "BackupToBigQueryController",
                    "List from Google Cloud Storage was empty", null);
        } else {

            for (String kind : getEntityNamesForBackup()) {
                Job job = new Job();
                JobConfiguration config = new JobConfiguration();
                JobConfigurationLoad loadConfig = new JobConfigurationLoad();

                String url = "";
                for (StorageObject obj : list.getItems()) {
                    String currentUrl = obj.getName();
                    if (currentUrl.contains(kind + ".backup_info")) {
                        url = currentUrl;
                        break;
                    }
                }

                if (StringUtils.isStringEmpty(url)) {
                    continue;
                } else {
                    url = "gs://"+bucket+"/" + url;
                }

                List<String> gsUrls = new ArrayList<>();
                gsUrls.add(url);

                loadConfig.setSourceUris(gsUrls);
                loadConfig.set("sourceFormat", "DATASTORE_BACKUP");
                loadConfig.set("allowQuotedNewlines", true);

                TableReference table = new TableReference();
                table.setProjectId(projectId);
                table.setDatasetId(datasetId);
                table.setTableId(kind);
                loadConfig.setDestinationTable(table);

                config.setLoad(loadConfig);
                job.setConfiguration(config);

                Bigquery bigquery = new Bigquery.Builder(HTTP_TRANSPORT, JSON_FACTORY, bqCredential)
                        .setApplicationName("BigQuery-Service-Accounts/0.1").setHttpRequestInitializer(bqCredential)
                        .build();
                Insert insert = bigquery.jobs().insert(projectId, job);

                JobReference jr = insert.execute().getJobReference();
                Log.info(BackupDBController.class, "BackupToBigQueryController",
                        "Moving data to BigQuery was successful", null);
            }
        }

If anyone has a better approach, Please let me know 如果有人有更好的方法,请告诉我

On the loading data from Google Cloud Storage article that you mentioned in your question, some programmatic examples of importing from GCS are described that use command line, Node.JS or Python. 关于您在问题中提到的Google云端存储文章加载数据,描述了使用命令行,Node.JS或Python从GCS导入的一些编程示例。

You can also automate the import data located on cloud storage to BigQuery, by running the following command in your script: 您还可以通过在脚本中运行以下命令,将位于云存储上的导入数据自动化为BigQuery:

$ gcloud alpha bigquery import SOURCE DESTINATION_TABLE

For more information on this command visit this article . 有关此命令的更多信息,请访问本文

As of last week there's a proper way to automate this. 截至上周,有一种适当的自动化方法。 The most important part is gcloud beta datastore export . 最重要的部分是gcloud beta datastore export

I created a short script around that: https://github.com/chees/datastore2bigquery 我创建了一个简短的脚本: https//github.com/chees/datastore2bigquery

You can adjust that to fit your situation. 您可以根据自己的情况进行调整。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM