简体   繁体   English

从Google Cloud BigQuery读取数据

[英]Reading data from Google Cloud BigQuery

I am new to Pipeline world and Google API DataFlow. 我是Pipeline世界和Google API DataFlow的新手。

I want to read data from BigQuery with sqlQuery. 我想使用sqlQuery从BigQuery读取数据。 When I read all database it's working OK. 当我读取所有数据库时,它工作正常。

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> qData = p.apply(
     BigQueryIO.Read
         .named("Read")
         .from("test:DataSetTest.data"));

But when I use fromQuery I got error. 但是当我使用fromQuery时出现错误。

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<TableRow> qData = p.apply(
     BigQueryIO.Read
         .named("Read")
         .fromQuery("SELECT * FROM DataSetTest.data"));

Error: 错误:

Exception in thread "main" java.lang.IllegalArgumentException: Validation of query "SELECT * FROM DataSetTest.data" failed. 线程“主”中的异常java.lang.IllegalArgumentException:查询“ SELECT * FROM DataSetTest.data”的验证失败。 If the query depends on an earlier stage of the pipeline, This validation can be disabled using #withoutValidation. 如果查询取决于管道的早期阶段,则可以使用#withoutValidation禁用此验证。

at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.dryRunQuery(BigQueryIO.java:449) com.google.cloud.dataflow.sdk.io.BigQueryIO $ Read $ Bound.dryRunQuery(BigQueryIO.java:449)

at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.validate(BigQueryIO.java:432) com.google.cloud.dataflow.sdk.io.BigQueryIO $ Read $ Bound.validate(BigQueryIO.java:432)

at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:357) 在com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:357)

at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:267) 在com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:267)

at com.google.cloud.dataflow.sdk.values.PBegin.apply(PBegin.java:47) 在com.google.cloud.dataflow.sdk.values.PBegin.apply(PBegin.java:47)

at com.google.cloud.dataflow.sdk.Pipeline.apply(Pipeline.java:151) 在com.google.cloud.dataflow.sdk.Pipeline.apply(Pipeline.java:151)

at Test.java.packageid.StarterPipeline.main(StarterPipeline.java:72) 在Test.java.packageid.StarterPipeline.main(StarterPipeline.java:72)

Caused by: java.lang.NullPointerException: Required parameter projectId must be specified. 原因:java.lang.NullPointerException:必须指定必需的参数projectId。

at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229) 在com.google.api.client.repackaged.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229)

at com.google.api.client.util.Preconditions.checkNotNull(Preconditions.java:140) 在com.google.api.client.util.Preconditions.checkNotNull(Preconditions.java:140)

at com.google.api.services.bigquery.Bigquery$Jobs$Query.(Bigquery.java:1751) com.google.api.services.bigquery.Bigquery $ Jobs $ Query。(Bigquery.java:1751)

at com.google.api.services.bigquery.Bigquery$Jobs.query(Bigquery.java:1724) 在com.google.api.services.bigquery.Bigquery $ Jobs.query(Bigquery.java:1724)

at com.google.cloud.dataflow.sdk.io.BigQueryIO$Read$Bound.dryRunQuery(BigQueryIO.java:445) com.google.cloud.dataflow.sdk.io.BigQueryIO $ Read $ Bound.dryRunQuery(BigQueryIO.java:445)

... 6 more ...另外6个

What is the problem here? 这里有什么问题?

UPDATE: 更新:

I set project by "options.setProject". 我通过“ options.setProject”设置项目。

PipelineOptions options = PipelineOptionsFactory.create();
    Pipeline p = Pipeline.create(options);
    options.setProject("test");
    PCollection<TableRow> qData = p.apply(
         BigQueryIO.Read
             .named("Read")
             .fromQuery("SELECT * FROM DataSetTest.data"));

But now I got this message. 但是现在我收到了此消息。 Table is not found. 找不到表格。

Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Table test:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832", "reason" : "notFound" } ], "message" : "Not found: Table test:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832" } 引起原因:com.google.api.client.googleapis.json.GoogleJsonResponseException:找不到404 {“代码”:404,“错误”:[{“域”:“全局”,“消息”:“未找到:表测试:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832“,”原因“:”未找到“}],”消息“:”未找到:表测试:_dataflow_temporary_dataset_737099.dataflow_temporary_table_550832“}

All resources in Google Cloud Platform, including BigQuery tables and Dataflow jobs, are associated with a cloud project. Google Cloud Platform中的所有资源,包括BigQuery表和Dataflow作业,都与一个云项目相关联。 Specifying the project is necessary when interacting with GCP resources. 与GCP资源互动时,必须指定项目。

The exception trace is saying that no cloud project is set for the BigQueryIO.Read transform: Caused by: java.lang.NullPointerException: Required parameter projectId must be specified . 异常跟踪表明没有为BigQueryIO.Read转换设置任何云项目: Caused by: java.lang.NullPointerException: Required parameter projectId must be specified

Dataflow controls the default value of the cloud project via its PipelineOptions API. 数据流通过其PipelineOptions API控制云项目的默认值。 Dataflow will default to using the project across its APIs, including BigQueryIO . 数据流将默认使用所有API(包括BigQueryIO使用项目。

Normally, we recommend constructing the PipelineOptions from command line arguments using PipelineOptionsFactory.fromArgs(String) API. 通常,我们建议使用PipelineOptionsFactory.fromArgs(String) API从命令行参数构造PipelineOptions In this case, you'd just pass --project=YOUR_PROJECT on the command line. 在这种情况下,您只需在命令行上传递--project=YOUR_PROJECT

Alternatively, this can be set manually in the code, as follows: 或者,可以在代码中手动设置,如下所示:

GcpOptions gcpOptions = options.as(GcpOptions.class);
options.setProject("YOUR_PROJECT");

Finally, starting with the version 1.4.0 of the Dataflow SDK for Java, Dataflow will default to using the cloud project set via gcloud config set project <project> . 最后,从Java的Dataflow SDK版本1.4.0开始,Dataflow将默认使用通过gcloud config set project <project>的云项目。 You can still override it via PipelineOptions , but don't need to. 您仍然可以通过PipelineOptions覆盖它,但不需要这样做。 This may have worked in some scenarios even before version 1.4.0, but may not have been reliable in all scenarios or combinations of versions of Cloud SDK and Dataflow SDK. 即使在1.4.0之前的版本中,这在某些情况下也可能有效,但在所有情况下或Cloud SDK和Dataflow SDK版本的组合中可能都不可靠。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Clud Dataflow 将数据从 Google Cloud Sql 读取到 BigQuery - Read the data from Google Cloud Sql to BigQuery using Clud Dataflow 从 BigQuery 读取并将数据存储到 Google 存储(特殊字符问题) - Reading from BigQuery and store data to Google storage (Special Character issue) 通过以下方式从 Java 中的 Google Cloud Function 读取 Firebase 数据库数据:几分钟后甚至会调用监听器 - Reading Firebase database data from a Google Cloud Function in Java via : Even listener is called in several minutes 如何使用Java将数据从Cloud Storage加载到BigQuery - How to load data from Cloud Storage into BigQuery using Java 从BigQuery读取数据并将其写入云存储中的avro文件格式 - Read data from BigQuery and write it into avro file format on cloud storage 使用Google Cloud Dataflow SDK读取流数据 - Reading streaming data using Google Cloud Dataflow SDK 如何使用服务帐户验证com.google.cloud.bigquery.BigQuery` - how to authenticate `com.google.cloud.bigquery.BigQuery` with service account 如何使用Eclipse和Java从Google BigQuery接收数据? - How can I receive Data from Google BigQuery with Eclipse and Java? 如何从BigQuery导出数据并将其作为.csv存储在Google存储空间中 - How to export data from BigQuery and store it as .csv in Google Storage 如何从 Java 服务将数据批量插入 Google BigQuery? - how to batch insert data into Google BigQuery from a Java service?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM