How to apply Limit and Offset based data select using Bigquery Storage Rad Api?
Below is the sample I am trying to read data from a BigQuery Table. It is fetching entire table and I can provide filters based on column values. But I want to apply LIMIT and OFFSET and provide Custom SQL for data fetch/read. Is it possible in Storage API?
import com.google.api.gax.rpc.ServerStream;
import com.google.cloud.bigquery.storage.v1.AvroRows;
import com.google.cloud.bigquery.storage.v1.BigQueryReadClient;
import com.google.cloud.bigquery.storage.v1.CreateReadSessionRequest;
import com.google.cloud.bigquery.storage.v1.DataFormat;
import com.google.cloud.bigquery.storage.v1.ReadRowsRequest;
import com.google.cloud.bigquery.storage.v1.ReadRowsResponse;
import com.google.cloud.bigquery.storage.v1.ReadSession;
import com.google.cloud.bigquery.storage.v1.ReadSession.TableModifiers;
import com.google.cloud.bigquery.storage.v1.ReadSession.TableReadOptions;
import com.google.common.base.Preconditions;
import com.google.protobuf.Timestamp;
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DecoderFactory;
public class StorageSample {
/*
* SimpleRowReader handles deserialization of the Avro-encoded row blocks transmitted
* from the storage API using a generic datum decoder.
*/
private static class SimpleRowReader {
private final DatumReader<GenericRecord> datumReader;
// Decoder object will be reused to avoid re-allocation and too much garbage collection.
private BinaryDecoder decoder = null;
// GenericRecord object will be reused.
private GenericRecord row = null;
public SimpleRowReader(Schema schema) {
Preconditions.checkNotNull(schema);
datumReader = new GenericDatumReader<>(schema);
}
/**
* Sample method for processing AVRO rows which only validates decoding.
*
* @param avroRows object returned from the ReadRowsResponse.
*/
public void processRows(AvroRows avroRows) throws IOException {
decoder =
DecoderFactory.get()
.binaryDecoder(avroRows.getSerializedBinaryRows().toByteArray(), decoder);
while (!decoder.isEnd()) {
// Reusing object row
row = datumReader.read(row, decoder);
System.out.println(row.toString());
}
}
}
public static void main(String... args) throws Exception {
// Sets your Google Cloud Platform project ID.
// String projectId = "YOUR_PROJECT_ID";
String projectId = "gcs-test";
Integer snapshotMillis = null;
// if (args.length > 1) {
// snapshotMillis = Integer.parseInt(args[1]);
// }
try (BigQueryReadClient client = BigQueryReadClient.create()) {
String parent = String.format("projects/%s", projectId);
// This example uses baby name data from the public datasets.
String srcTable =
String.format(
"projects/%s/datasets/%s/tables/%s",
"gcs-test", "testdata", "testtable");
// We specify the columns to be projected by adding them to the selected fields,
// and set a simple filter to restrict which rows are transmitted.
TableReadOptions options =
TableReadOptions.newBuilder()
.addSelectedFields("id")
.addSelectedFields("qtr")
.addSelectedFields("sales")
.addSelectedFields("year")
.addSelectedFields("comments")
//.setRowRestriction("state = \"WA\"")
.build();
// Start specifying the read session we want created.
ReadSession.Builder sessionBuilder =
ReadSession.newBuilder()
.setTable(srcTable)
// This API can also deliver data serialized in Apache Avro format.
// This example leverages Apache Avro.
.setDataFormat(DataFormat.AVRO)
.setReadOptions(options);
// Optionally specify the snapshot time. When unspecified, snapshot time is "now".
if (snapshotMillis != null) {
Timestamp t =
Timestamp.newBuilder()
.setSeconds(snapshotMillis / 1000)
.setNanos((int) ((snapshotMillis % 1000) * 1000000))
.build();
TableModifiers modifiers = TableModifiers.newBuilder().setSnapshotTime(t).build();
sessionBuilder.setTableModifiers(modifiers);
}
// Begin building the session creation request.
CreateReadSessionRequest.Builder builder =
CreateReadSessionRequest.newBuilder()
.setParent(parent)
.setReadSession(sessionBuilder)
.setMaxStreamCount(1);
// Request the session creation.
ReadSession session = client.createReadSession(builder.build());
SimpleRowReader reader =
new SimpleRowReader(new Schema.Parser().parse(session.getAvroSchema().getSchema()));
// Assert that there are streams available in the session. An empty table may not have
// data available. If no sessions are available for an anonymous (cached) table, consider
// writing results of a query to a named table rather than consuming cached results directly.
Preconditions.checkState(session.getStreamsCount() > 0);
// Use the first stream to perform reading.
String streamName = session.getStreams(0).getName();
ReadRowsRequest readRowsRequest =
ReadRowsRequest.newBuilder().setReadStream(streamName).build();
// Process each block of rows as they arrive and decode using our simple row reader.
ServerStream<ReadRowsResponse> stream = client.readRowsCallable().call(readRowsRequest);
for (ReadRowsResponse response : stream) {
Preconditions.checkState(response.hasAvroRows());
reader.processRows(response.getAvroRows());
}
}
}
}
With the BigQuery Storage read API functionality, LIMIT is effectively just a case of stopping row reading after you've processed the desired number of elements.
Applying the notion of an OFFSET clause is a bit more nuanced, as it implies ordering. If you're reading a table via multiple streams for improved throughput, you're either disregarding ordering entirely, or you're re-ordering data after you've read it from the API.
If you read a table as a single string, you preserve whatever ordering the input table had, and can specify the offset
field in the ReadRowsRequest
to start at a given offset.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.