简体   繁体   English

BigQuery-通过Java流式传输非常慢

[英]BigQuery - streaming via java is very slow

Im attempting to stream data from a kafka installation into BigQuery using Java based on Google samples . 我正在尝试使用基于Google 示例的 Java将数据从kafka安装流式传输到BigQuery The data is JSON rows ~12K in length. 数据是JSON行,长度约12K。 I batching these into blocks of 500 (roughly 6Mb) and streaming them as: 我将它们分批放入500(大约6Mb)的块中,并将其流式传输为:

InsertAllRequest.Builder builder = InsertAllRequest.newBuilder(tableId);

for (String record : bqStreamingPacket.getRecords()) {
    Map<String, Object> mapObject = objectMapper.readValue(record.replaceAll("\\{,", "{"), new TypeReference<Map<String, Object>>() {});

    // remove nulls
    mapObject.values().removeIf(Objects::isNull);

    // create an id for each row - use to retry / avoid duplication
    builder.addRow(String.valueOf(System.nanoTime()), mapObject);
}

insertAllRequest = builder.build();

...


BigQueryOptions bigQueryOptions = BigQueryOptions.newBuilder().
    setCredentials(Credentials.getAppCredentials()).build();

BigQuery bigQuery = bigQueryOptions.getService();

InsertAllResponse insertAllResponse = bigQuery.insertAll(insertAllRequest);

Im seeing insert times of 3-5 seconds for each call. 我看到每个呼叫的插入时间为3-5秒。 Needless to say this makes BQ streaming less than useful. 不用说,这使BQ流式传输变得无用。 From their documents I was worried about hitting per-table insert quotas (Im streaming from Kafka at ~1M rows / min) but now Id be happy to deal with that problem. 从他们的文档中,我担心会达到按表插入的配额(Im从Kafka以约1M行/分钟的速度流式传输),但现在Id很高兴处理该问题。

All rows insert fine. 所有行都可以插入。 No errors. 没有错误。

I must be doing something very wrong with this setup. 我一定在这个设置上做错了什么。 Please advise. 请指教。

We measure between 1200-2500 ms for each streaming request, and this was consistent over the last three years as you can see in the chart, we stream from Softlayer to Google. 我们为每个流请求测量的时间在1200-2500毫秒之间,从图表中可以看出,这在过去三年中是一致的,我们从Softlayer流向Google。

在此处输入图片说明

Try to vary the numbers from hundreds to thousands row, or until you reach some streaming api limits and measure each call. 尝试将数字从几百行更改为几千行,或者直到达到一些流API限制并衡量每个调用。

Based on this you can deduce more information such as bandwidth problem between you and BigQuery API, latency, SSL handshake, and eventually optimize it for your environment. 基于此,您可以推断出更多信息,例如您与BigQuery API之间的带宽问题,延迟,SSL握手,并最终针对您的环境对其进行优化。

You can leave also your project id/table and maybe some Google engineer will check it. 您还可以留下您的项目ID /表格,也许有些Google工程师会对其进行检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM