[英]Reading data from Google Cloud BigQuery
I am new to Pipeline world and Google API DataFlow. 我是Pipeline世界和Google API DataFlow的新手。
I want to read data from BigQuery with sqlQuery. 我想使用sqlQuery从BigQuery读取数据。 When I read all database it's working OK.
当我读取所有数据库时,它工作正常。
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> qData = p.apply(
BigQueryIO.Read
.named("Read")
.from("test:DataSetTest.data"));
But when I use fromQuery I got error. 但是当我使用fromQuery时出现错误。
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> qData = p.apply(
BigQueryIO.Read
.named("Read")
.fromQuery("SELECT * FROM DataSetTest.data"));
Error: 错误:
Exception in thread "main" java.lang.IllegalArgumentException: Validation of query "SELECT * FROM DataSetTest.data" failed.
线程“主”中的异常java.lang.IllegalArgumentException:查询“ SELECT * FROM DataSetTest.data”的验证失败。 If the query depends on an earlier stage of the pipeline, This validation can be disabled using #withoutValidation.
如果查询取决于管道的早期阶段,则可以使用#withoutValidation禁用此验证。
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.dryRunQuery(BigQueryIO.java:449)
com.google.cloud.dataflow.sdk.io.BigQueryIO $ Read $ Bound.dryRunQuery(BigQueryIO.java:449)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.validate(BigQueryIO.java:432)
com.google.cloud.dataflow.sdk.io.BigQueryIO $ Read $ Bound.validate(BigQueryIO.java:432)
at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:357)
在com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:357)
at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:267)
在com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:267)
at com.google.cloud.dataflow.sdk.values.PBegin.apply(PBegin.java:47)
在com.google.cloud.dataflow.sdk.values.PBegin.apply(PBegin.java:47)
at com.google.cloud.dataflow.sdk.Pipeline.apply(Pipeline.java:151)
在com.google.cloud.dataflow.sdk.Pipeline.apply(Pipeline.java:151)
at Test.java.packageid.StarterPipeline.main(StarterPipeline.java:72)
在Test.java.packageid.StarterPipeline.main(StarterPipeline.java:72)
Caused by: java.lang.NullPointerException: Required parameter projectId must be specified.
原因:java.lang.NullPointerException:必须指定必需的参数projectId。
at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229)
在com.google.api.client.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229)
at com.google.api.client.util.Preconditions.checkNotNull(Preconditions.java:140)
在com.google.api.client.util.Preconditions.checkNotNull(Preconditions.java:140)
at com.google.api.services.bigquery.Bigquery$Jobs$Query.(Bigquery.java:1751)
com.google.api.services.bigquery.Bigquery $ Jobs $ Query。(Bigquery.java:1751)
at com.google.api.services.bigquery.Bigquery$Jobs.query(Bigquery.java:1724)
在com.google.api.services.bigquery.Bigquery $ Jobs.query(Bigquery.java:1724)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.dryRunQuery(BigQueryIO.java:445)
com.google.cloud.dataflow.sdk.io.BigQueryIO $ Read $ Bound.dryRunQuery(BigQueryIO.java:445)
... 6 more
...另外6个
What is the problem here? 这里有什么问题?
UPDATE: 更新:
I set project by "options.setProject". 我通过“ options.setProject”设置项目。
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
options.setProject("test");
PCollection<TableRow> qData = p.apply(
BigQueryIO.Read
.named("Read")
.fromQuery("SELECT * FROM DataSetTest.data"));
But now I got this message. 但是现在我收到了此消息。 Table is not found.
找不到表格。
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Table test:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832", "reason" : "notFound" } ], "message" : "Not found: Table test:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832" }
引起原因:com.google.api.client.googleapis.json.GoogleJsonResponseException:找不到404 {“代码”:404,“错误”:[{“域”:“全局”,“消息”:“未找到:表测试:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832“,”原因“:”未找到“}],”消息“:”未找到:表测试:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832“}
All resources in Google Cloud Platform, including BigQuery tables and Dataflow jobs, are associated with a cloud project. Google Cloud Platform中的所有资源,包括BigQuery表和Dataflow作业,都与一个云项目相关联。 Specifying the project is necessary when interacting with GCP resources.
与GCP资源互动时,必须指定项目。
The exception trace is saying that no cloud project is set for the BigQueryIO.Read
transform: Caused by: java.lang.NullPointerException: Required parameter projectId must be specified
. 异常跟踪表明没有为
BigQueryIO.Read
转换设置任何云项目: Caused by: java.lang.NullPointerException: Required parameter projectId must be specified
。
Dataflow controls the default value of the cloud project via its PipelineOptions
API. 数据流通过其
PipelineOptions
API控制云项目的默认值。 Dataflow will default to using the project across its APIs, including BigQueryIO
. 数据流将默认使用所有API(包括
BigQueryIO
使用项目。
Normally, we recommend constructing the PipelineOptions
from command line arguments using PipelineOptionsFactory.fromArgs(String)
API. 通常,我们建议使用
PipelineOptionsFactory.fromArgs(String)
API从命令行参数构造PipelineOptions
。 In this case, you'd just pass --project=YOUR_PROJECT
on the command line. 在这种情况下,您只需在命令行上传递
--project=YOUR_PROJECT
。
Alternatively, this can be set manually in the code, as follows: 或者,可以在代码中手动设置,如下所示:
GcpOptions gcpOptions = options.as(GcpOptions.class);
options.setProject("YOUR_PROJECT");
Finally, starting with the version 1.4.0 of the Dataflow SDK for Java, Dataflow will default to using the cloud project set via gcloud config set project <project>
. 最后,从Java的Dataflow SDK版本1.4.0开始,Dataflow将默认使用通过
gcloud config set project <project>
的云项目。 You can still override it via PipelineOptions
, but don't need to. 您仍然可以通过
PipelineOptions
覆盖它,但不需要这样做。 This may have worked in some scenarios even before version 1.4.0, but may not have been reliable in all scenarios or combinations of versions of Cloud SDK and Dataflow SDK. 即使在1.4.0之前的版本中,这在某些情况下也可能有效,但在所有情况下或Cloud SDK和Dataflow SDK版本的组合中可能都不可靠。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.