简体   繁体   English

GCP Dataflow 无法访问其他 GCP 项目中的 BigQuery 数据集

[英]GCP Dataflow is unable to access BigQuery dataset in a different GCP project

I have a GCP dataflow which reads two datasets in two different GCP projects and compare them.我有一个 GCP 数据流,它读取两个不同 GCP 项目中的两个数据集并进行比较。

It works fine for two datasets in the same project.它适用于同一项目中的两个数据集。 However, when I tried to compare two datasets in different projects, I got an error:但是,当我尝试比较不同项目中的两个数据集时,出现错误:

{
  "message": "java.lang.RuntimeException: Unable to confirm BigQuery dataset presence for table \"my-other-project:my_dataset_other.2022-07-13_My_BigQuery_Table\". If the dataset is created by an earlier stage of the pipeline, this validation can be disabled using #withoutValidation.",
  "stacktrace": "java.lang.RuntimeException: java.lang.RuntimeException: Unable to confirm BigQuery dataset presence for table \"my-other-project:my_dataset_other.2022-07-13_My_BigQuery_Table\". If the dataset is created by an earlier stage of the pipeline, this validation can be disabled using #withoutValidation.\n\tat org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$TypedRead.validate(BigQueryIO.java:1018)\n\tat org.apache.beam.sdk.Pipeline$ValidateVisitor.enterCompositeTransform(Pipeline.java:662)\n\tat org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:581)\n\tat org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:585)\n\tat org.apache.beam.sdk.runners.TransformHierarchy$Node.access$500(TransformHierarchy.java:240)\n\tat org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:214)\n\tat org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:469)\n\tat org.apache.beam.sdk.Pipeline.validate(Pipeline.java:598)\n\tat org.apache.beam.sdk.Pipeline.run(Pipeline.java:322)\n\tat org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)\n\tat ....
  org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)\n\tat org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)\n\tat org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)\n\tat org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)\n\tat org.springframework.security.access.intercept.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:61)\n\tat org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)\n\tat org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)\n\tat org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708)\n\tat 
  ........
  
  org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers.verifyDatasetPresence(BigQueryHelpers.java:521)\n\t... 116 more\nCaused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden\nGET https://bigquery.googleapis.com/bigquery/v2/projects/my-other-project/datasets/my_dataset_other?prettyPrint=false\n{\n  \"code\" : 403,\n  \"errors\" : [ {\n    \"domain\" : \"global\",\n    \"message\" : \"Access Denied: Dataset my-other-project:my_dataset_other: Permission bigquery.datasets.get denied on dataset my-other-project:my_dataset_other (or it may not exist).\",\n    \"reason\" : \"accessDenied\"\n  } ],\n  \"message\" : \"Access Denied: Dataset my-other-project:my_dataset_other: Permission bigquery.datasets.get denied on dataset my-other-project:my_dataset_other (or it may not exist).\",\n  \"status\" : \"PERMISSION_DENIED\"\n}\n\tat com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)\n\tat com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)\n\tat com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)\n\tat com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)\n\tat com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)\n\tat com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)\n\tat com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)\n\tat com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)\n\tat org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.executeWithRetries(BigQueryServicesImpl.java:1324)\n\t... 118 more\n"
}
Response headers
 cache-control: no-cacheno-storemax-age=0must-revalidate 
 connection: close 
 content-type: application/json 
 date: Wed20 Jul 2022 13:40:30 GMT 
 expires: 0 
 pragma: no-cache 

This error happens when the dataflow pipeline that runs on ProjectA tries to access data in my-other-project:my_dataset_other.当在 ProjectA 上运行的数据流管道尝试访问 my-other-project:my_dataset_other 中的数据时,会发生此错误。 The dataflow runs using a service account my_user@projecta.iam.gserviceaccount.com.数据流使用服务帐户 my_user@projecta.iam.gserviceaccount.com 运行。

I have given this service account a role of "BigData Data Viewer" on my-other-project:my_dataset_other.我已在 my-other-project:my_dataset_other 上为该服务帐户指定了“BigData Data Viewer”的角色。

EDIT:编辑:

The code is something like this:代码是这样的:

private PCollection<MyModel> readModelFromBigQuery(Pipeline pipeline, String projectId, String datasetId, String table) {
    var tableReference = new TableReference()
            .setProjectId(projectId)
            .setDatasetId(datasetId)
            .setTableId(table);

    return pipeline
            .apply(BigQueryIO.readTableRows().from(tableReference ))
            .apply(MapElements.into(TypeDescriptor.of(MyModel.class)).via(MyModel::fromTableRow));
}



var pCollection1 = readModelFromBigQuery(pipeline, "my-first-project", "my_dataset_first", "2022-07-13_My_BigQuery_Table");
var pCollection2 = readModelFromBigQuery(pipeline, "my-other-project", "my_dataset_other", "2022-07-13_My_BigQuery_Table");

PCollectionList.of(pCollection1).and(pCollection2)
            .apply(new MyTransformation())
            .apply(BigQueryIO.<MyModel>write()
                .to(composeDestinationTableName())
.......

Can someone tell me what went wrong?有人可以告诉我出了什么问题吗?

The solution is easier than expected!解决方案比预期的要容易! My eyes felt on the 403 forbidden error today and you miss the biquery.datasets.get permission.今天我的眼睛感觉到了 403 禁止错误,而您错过了 biquery.datasets.get 权限。

Of course, for getting the data, you only need to be BQ data viewer on the dataset, but, obviously, the Beam connector first list the datasets, and then query the data in the datasets.当然,要获取数据,你只需要成为数据集上的 BQ 数据查看器,但是,显然 Beam 连接器首先列出数据集,然后查询数据集中的数据。 So, you have to grant the capacity to list the datasets at project level.因此,您必须授予在项目级别列出数据集的能力。

It's a bad new for least privilege principle, but simply grant your service account at project level.对于最低权限原则来说,这是一个坏消息,但只需在项目级别授予您的服务帐户即可。 To limit the scope of the permissions, you can grant the role roles/bigquery.metadataViewer at project level.要限制权限范围,您可以在项目级别授予角色roles/bigquery.metadataViewer It's not too wide, and not dangerous at all (better than data viewer at project scope)它不是太宽,而且一点也不危险(比项目范围内的数据查看器更好)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 允许用户写入对gcp数据流项目的访问权限 - Allow user to write access to gcp dataflow project 无法迁移 GCP 项目 - Unable to Migrate GCP Project 从 GCP 中的日志接收器创建 BigQuery 数据集 - Creating a BigQuery dataset from a log sink in GCP 如何将一个 bigquery 数据集复制或传输到来自不同 gcp 组织的另一个 bigquery - How to copy or transfer one bigquery dataset to another bigquery from a different gcp organization 导出 GCP 数据存储区并导入到不同的 GCP 项目 - Export GCP Datastore and import to a different GCP Project 即使将区域设置为 BigQuery 数据集的区域,Dataflow 作业也无法写入不同区域的 BigQuery 数据集 - Dataflow job unable to write to BigQuery dataset in different region even if zone is set to the region of bigquery dataset 尝试在 GCP 中创建 BigQuery 数据集传输配置从项目 A 到项目 B - Trying to Create a BigQuery Dataset Transfer config from Project A to Project B in GCP 运行流式传输的GCP数据流插入到BigQuery:GC释放 - GCP Dataflow running streaming inserts into BigQuery: GC Thrashing GCP - 创建数据流(Pub/Sub -&gt; 预测(ML 模型)-&gt; BigQuery/Firebase) - GCP - Creating a Dataflow (Pub/Sub -> prediction(ML model) -> BigQuery/Firebase) 通过适用于 Bigquery 和 Cloud SQL 的 Beam / Dataflow API 进行 GCP 加密 - GCP encryption thru Beam / Dataflow APIs for Bigquery and Cloud SQL
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM