BigQuery-如何在Java客户端库中设置读取超时

Question

I am using Spark to load some data into BigQuery. 我正在使用Spark将一些数据加载到BigQuery中。 The idea is to read data from S3 and use Spark and BigQuery client API to load data. 这个想法是从S3读取数据，并使用Spark和BigQuery客户端API加载数据。 Below is the code that does the insert into BigQuery. 以下是将代码插入BigQuery中的代码。

val bq = createAuthorizedClientWithDefaultCredentialsFromStream(appName, credentialStream)
val bqjob = bq.jobs().insert(pid, job, data).execute() // data is a InputStream content

With this approach, I am seeing lot of SocketTimeoutException. 通过这种方法，我看到了很多SocketTimeoutException。

Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:911)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequest(MediaHttpUploader.java:562)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:419)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)

Looks like the delay in reading from S3 causes Google http-client to timeout. 似乎从S3读取的延迟导致Google http客户端超时。 I wanted to increase the timeout and tried the below options. 我想增加超时时间，并尝试了以下选项。

val req = bq.jobs().insert(pid, job, data).buildHttpRequest()
req.setReadTimeout(3 * 60 * 1000)
val res = req.execute()

But this causes a Precondition failure in BigQuery. 但这会导致BigQuery中的前提条件失败。 It expects the mediaUploader to be null, not sure why though. 它期望mediaUploader为null，但不确定为什么。

Exception in thread "main" java.lang.IllegalArgumentException
    at com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
    at com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:37)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:297)

This caused me to try the second insert API on BigQuery 这导致我尝试在BigQuery上使用第二个insert API

val req = bq.jobs().insert(pid, job).buildHttpRequest().setReadTimeout(3 * 60 * 1000).setContent(data)
val res = req.execute()

And this time it failed with a different error. 这次它以另一个错误失败。

Exception in thread "main" com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: ",
    "reason" : "invalid"
  } ],
  "message" : "Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0: "
}

Please suggest me how I can set the timeout. 请建议我如何设置超时时间。 Also point me if I am doing something wrong. 如果我做错了事，也要指出我。

Answer 1

I'll answer the main question from the title: how to set timeouts using the Java client library. 我将从标题中回答主要问题：如何使用Java客户端库设置超时。

To set timeouts, you need a custom HttpRequestInitializer configured in your client. 要设置超时，您需要在客户端中配置一个自定义HttpRequestInitializer。 For example: 例如：

Bigquery.Builder builder =
    new Bigquery.Builder(new UrlFetchTransport(), new JacksonFactory(), credential);
final HttpRequestInitializer existing = builder.getHttpRequestInitializer();
builder.setHttpRequestInitializer(new HttpRequestInitializer() {
    @Override
    public void initialize(HttpRequest request) throws IOException {
      existing.initialize(request);
      request
          .setReadTimeout(READ_TIMEOUT)
          .setConnectTimeout(CONNECTION_TIMEOUT);
      }
    });
Bigquery client = builder.build();

I don't think this will solve all the issues you are facing. 我认为这不会解决您面临的所有问题。 A few ideas that might be helpful, but I don't fully understand the scenario so these may be off track: 一些想法可能会有所帮助，但我对场景不完全了解，因此可能会偏离轨道：

If you are moving large files: consider staging them on GCS before loading them into BigQuery. 如果要移动大文件，请考虑在将它们加载到BigQuery之前在GCS上暂存它们。
If you are using media upload to send the data with your request: these can't be too large or you risk timeouts or network connection failures. 如果您使用媒体上传功能随请求发送数据：这些数据不能太大，否则可能会导致超时或网络连接失败。
If you are running an embarrassingly parallel data migration, and the data chunks are relatively small, bigquery.tabledata.insertAll may be more appropriate for large fan-in scenarios like this. 如果您正在执行令人尴尬的并行数据迁移，并且数据块相对较小，则bigquery.tabledata.insertAll可能更适用于此类大型扇入式场景。 See https://cloud.google.com/bigquery/streaming-data-into-bigquery for more details. 有关更多详细信息，请参见https://cloud.google.com/bigquery/streaming-data-into-bigquery 。

Thanks for the question! 谢谢你的问题！

BigQuery-如何在Java客户端库中设置读取超时

问题描述

1 个解决方案

解决方案1
1 2015-10-01 18:41:28

BigQuery-如何在Java客户端库中设置读取超时

问题描述

1 个解决方案

解决方案1 1 2015-10-01 18:41:28

解决方案1
1 2015-10-01 18:41:28