GCP Dataflow 无法访问其他 GCP 项目中的 BigQuery 数据集

Question

I have a GCP dataflow which reads two datasets in two different GCP projects and compare them.我有一个 GCP 数据流，它读取两个不同 GCP 项目中的两个数据集并进行比较。

It works fine for two datasets in the same project.它适用于同一项目中的两个数据集。 However, when I tried to compare two datasets in different projects, I got an error:但是，当我尝试比较不同项目中的两个数据集时，出现错误：

{
  "message": "java.lang.RuntimeException: Unable to confirm BigQuery dataset presence for table \"my-other-project:my_dataset_other.2022-07-13_My_BigQuery_Table\". If the dataset is created by an earlier stage of the pipeline, this validation can be disabled using #withoutValidation.",
  "stacktrace": "java.lang.RuntimeException: java.lang.RuntimeException: Unable to confirm BigQuery dataset presence for table \"my-other-project:my_dataset_other.2022-07-13_My_BigQuery_Table\". If the dataset is created by an earlier stage of the pipeline, this validation can be disabled using #withoutValidation.\n\tat org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$TypedRead.validate(BigQueryIO.java:1018)\n\tat org.apache.beam.sdk.Pipeline$ValidateVisitor.enterCompositeTransform(Pipeline.java:662)\n\tat org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:581)\n\tat org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)\n\tat org.apache.beam.sdk.runners.TransformHierarchy$Node.access$500(TransformHierarchy.java:240)\n\tat org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:214)\n\tat org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:469)\n\tat org.apache.beam.sdk.Pipeline.validate(Pipeline.java:598)\n\tat org.apache.beam.sdk.Pipeline.run(Pipeline.java:322)\n\tat org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)\n\tat ....
  org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)\n\tat org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)\n\tat org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)\n\tat org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)\n\tat org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:61)\n\tat org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)\n\tat org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)\n\tat org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)\n\tat 
  ........
  
  org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers.verifyDatasetPresence(BigQueryHelpers.java:521)\n\t... 116 more\nCaused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden\nGET https://bigquery.googleapis.com/bigquery/v2/projects/my-other-project/datasets/my_dataset_other?prettyPrint=false\n{\n  \"code\" : 403,\n  \"errors\" : [ {\n    \"domain\" : \"global\",\n    \"message\" : \"Access Denied: Dataset my-other-project:my_dataset_other: Permission bigquery.datasets.get denied on dataset my-other-project:my_dataset_other (or it may not exist).\",\n    \"reason\" : \"accessDenied\"\n  } ],\n  \"message\" : \"Access Denied: Dataset my-other-project:my_dataset_other: Permission bigquery.datasets.get denied on dataset my-other-project:my_dataset_other (or it may not exist).\",\n  \"status\" : \"PERMISSION_DENIED\"\n}\n\tat com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)\n\tat com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)\n\tat com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)\n\tat com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)\n\tat com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)\n\tat com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)\n\tat com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)\n\tat com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)\n\tat org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.executeWithRetries(BigQueryServicesImpl.java:1324)\n\t... 118 more\n"
}
Response headers
 cache-control: no-cacheno-storemax-age=0must-revalidate 
 connection: close 
 content-type: application/json 
 date: Wed20 Jul 2022 13:40:30 GMT 
 expires: 0 
 pragma: no-cache

This error happens when the dataflow pipeline that runs on ProjectA tries to access data in my-other-project:my_dataset_other.当在 ProjectA 上运行的数据流管道尝试访问 my-other-project:my_dataset_other 中的数据时，会发生此错误。 The dataflow runs using a service account my_user@projecta.iam.gserviceaccount.com.数据流使用服务帐户 my_user@projecta.iam.gserviceaccount.com 运行。

I have given this service account a role of "BigData Data Viewer" on my-other-project:my_dataset_other.我已在 my-other-project:my_dataset_other 上为该服务帐户指定了“BigData Data Viewer”的角色。

EDIT:编辑：

The code is something like this:代码是这样的：

private PCollection<MyModel> readModelFromBigQuery(Pipeline pipeline, String projectId, String datasetId, String table) {
    var tableReference = new TableReference()
            .setProjectId(projectId)
            .setDatasetId(datasetId)
            .setTableId(table);

    return pipeline
            .apply(BigQueryIO.readTableRows().from(tableReference ))
            .apply(MapElements.into(TypeDescriptor.of(MyModel.class)).via(MyModel::fromTableRow));
}



var pCollection1 = readModelFromBigQuery(pipeline, "my-first-project", "my_dataset_first", "2022-07-13_My_BigQuery_Table");
var pCollection2 = readModelFromBigQuery(pipeline, "my-other-project", "my_dataset_other", "2022-07-13_My_BigQuery_Table");

PCollectionList.of(pCollection1).and(pCollection2)
            .apply(new MyTransformation())
            .apply(BigQueryIO.<MyModel>write()
                .to(composeDestinationTableName())
.......

Can someone tell me what went wrong?有人可以告诉我出了什么问题吗？

Answer 1

The solution is easier than expected!解决方案比预期的要容易！ My eyes felt on the 403 forbidden error today and you miss the biquery.datasets.get permission.今天我的眼睛感觉到了 403 禁止错误，而您错过了 biquery.datasets.get 权限。

Of course, for getting the data, you only need to be BQ data viewer on the dataset, but, obviously, the Beam connector first list the datasets, and then query the data in the datasets.当然，要获取数据，你只需要成为数据集上的 BQ 数据查看器，但是，显然 Beam 连接器首先列出数据集，然后查询数据集中的数据。 So, you have to grant the capacity to list the datasets at project level.因此，您必须授予在项目级别列出数据集的能力。

It's a bad new for least privilege principle, but simply grant your service account at project level.对于最低权限原则来说，这是一个坏消息，但只需在项目级别授予您的服务帐户即可。 To limit the scope of the permissions, you can grant the role roles/bigquery.metadataViewer at project level.要限制权限范围，您可以在项目级别授予角色roles/bigquery.metadataViewer 。 It's not too wide, and not dangerous at all (better than data viewer at project scope)它不是太宽，而且一点也不危险（比项目范围内的数据查看器更好）

GCP Dataflow 无法访问其他 GCP 项目中的 BigQuery 数据集

问题描述

1 个解决方案

解决方案1
0 2022-07-22 16:11:57

GCP Dataflow 无法访问其他 GCP 项目中的 BigQuery 数据集

问题描述

1 个解决方案

解决方案1 0 2022-07-22 16:11:57

解决方案1
0 2022-07-22 16:11:57