BigQuery Storage Read API with Limit and Offset

Question

How to apply Limit and Offset based data select using Bigquery Storage Rad Api?

Below is the sample I am trying to read data from a BigQuery Table. It is fetching entire table and I can provide filters based on column values. But I want to apply LIMIT and OFFSET and provide Custom SQL for data fetch/read. Is it possible in Storage API?

import com.google.api.gax.rpc.ServerStream;
import com.google.cloud.bigquery.storage.v1.AvroRows;
import com.google.cloud.bigquery.storage.v1.BigQueryReadClient;
import com.google.cloud.bigquery.storage.v1.CreateReadSessionRequest;
import com.google.cloud.bigquery.storage.v1.DataFormat;
import com.google.cloud.bigquery.storage.v1.ReadRowsRequest;
import com.google.cloud.bigquery.storage.v1.ReadRowsResponse;
import com.google.cloud.bigquery.storage.v1.ReadSession;
import com.google.cloud.bigquery.storage.v1.ReadSession.TableModifiers;
import com.google.cloud.bigquery.storage.v1.ReadSession.TableReadOptions;
import com.google.common.base.Preconditions;
import com.google.protobuf.Timestamp;
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DecoderFactory;

public class StorageSample {

    /*
     * SimpleRowReader handles deserialization of the Avro-encoded row blocks transmitted
     * from the storage API using a generic datum decoder.
     */
    private static class SimpleRowReader {

        private final DatumReader<GenericRecord> datumReader;

        // Decoder object will be reused to avoid re-allocation and too much garbage collection.
        private BinaryDecoder decoder = null;

        // GenericRecord object will be reused.
        private GenericRecord row = null;

        public SimpleRowReader(Schema schema) {
            Preconditions.checkNotNull(schema);
            datumReader = new GenericDatumReader<>(schema);
        }

        /**
         * Sample method for processing AVRO rows which only validates decoding.
         *
         * @param avroRows object returned from the ReadRowsResponse.
         */
        public void processRows(AvroRows avroRows) throws IOException {
            decoder =
                    DecoderFactory.get()
                            .binaryDecoder(avroRows.getSerializedBinaryRows().toByteArray(), decoder);

            while (!decoder.isEnd()) {
                // Reusing object row
                row = datumReader.read(row, decoder);
                System.out.println(row.toString());
            }
        }
    }

    public static void main(String... args) throws Exception {
        // Sets your Google Cloud Platform project ID.
        // String projectId = "YOUR_PROJECT_ID";
        String projectId = "gcs-test";
        Integer snapshotMillis = null;
//        if (args.length > 1) {
//            snapshotMillis = Integer.parseInt(args[1]);
//        }

        try (BigQueryReadClient client = BigQueryReadClient.create()) {
            String parent = String.format("projects/%s", projectId);

            // This example uses baby name data from the public datasets.
            String srcTable =
                    String.format(
                            "projects/%s/datasets/%s/tables/%s",
                            "gcs-test", "testdata", "testtable");

            // We specify the columns to be projected by adding them to the selected fields,
            // and set a simple filter to restrict which rows are transmitted.
            TableReadOptions options =
                    TableReadOptions.newBuilder()
                            .addSelectedFields("id")
                            .addSelectedFields("qtr")
                            .addSelectedFields("sales")
                            .addSelectedFields("year")
                            .addSelectedFields("comments")
                            //.setRowRestriction("state = \"WA\"")
                            .build();

            // Start specifying the read session we want created.
            ReadSession.Builder sessionBuilder =
                    ReadSession.newBuilder()
                            .setTable(srcTable)
                            // This API can also deliver data serialized in Apache Avro format.
                            // This example leverages Apache Avro.
                            .setDataFormat(DataFormat.AVRO)
                            .setReadOptions(options);

            // Optionally specify the snapshot time.  When unspecified, snapshot time is "now".
            if (snapshotMillis != null) {
                Timestamp t =
                        Timestamp.newBuilder()
                                .setSeconds(snapshotMillis / 1000)
                                .setNanos((int) ((snapshotMillis % 1000) * 1000000))
                                .build();
                TableModifiers modifiers = TableModifiers.newBuilder().setSnapshotTime(t).build();
                sessionBuilder.setTableModifiers(modifiers);
            }

            // Begin building the session creation request.
            CreateReadSessionRequest.Builder builder =
                    CreateReadSessionRequest.newBuilder()
                            .setParent(parent)
                            .setReadSession(sessionBuilder)
                            .setMaxStreamCount(1);

            // Request the session creation.
            ReadSession session = client.createReadSession(builder.build());

            SimpleRowReader reader =
                    new SimpleRowReader(new Schema.Parser().parse(session.getAvroSchema().getSchema()));

            // Assert that there are streams available in the session.  An empty table may not have
            // data available.  If no sessions are available for an anonymous (cached) table, consider
            // writing results of a query to a named table rather than consuming cached results directly.
            Preconditions.checkState(session.getStreamsCount() > 0);

            // Use the first stream to perform reading.
            String streamName = session.getStreams(0).getName();

            ReadRowsRequest readRowsRequest =
                    ReadRowsRequest.newBuilder().setReadStream(streamName).build();

            // Process each block of rows as they arrive and decode using our simple row reader.
            ServerStream<ReadRowsResponse> stream = client.readRowsCallable().call(readRowsRequest);
            for (ReadRowsResponse response : stream) {
                Preconditions.checkState(response.hasAvroRows());
                reader.processRows(response.getAvroRows());
            }
        }
    }
}

Answer 1

With the BigQuery Storage read API functionality, LIMIT is effectively just a case of stopping row reading after you've processed the desired number of elements.

Applying the notion of an OFFSET clause is a bit more nuanced, as it implies ordering. If you're reading a table via multiple streams for improved throughput, you're either disregarding ordering entirely, or you're re-ordering data after you've read it from the API.

If you read a table as a single string, you preserve whatever ordering the input table had, and can specify the offset field in the ReadRowsRequest to start at a given offset.

BigQuery Storage Read API with Limit and Offset

Question

1 answers

solution1
1 2022-02-16 18:56:12

BigQuery Storage Read API with Limit and Offset

Question

1 answers

solution1 1 2022-02-16 18:56:12

solution1
1 2022-02-16 18:56:12