简体   繁体   中英

Flatten an array on bigquery

I'm using BigQuery and I have a table with nested fields which I need to flatten in order to be able to download the table. I found explanation page ( https://cloud.google.com/life-sciences/docs/how-tos/flatten-bigquery-table ) but I don't understand it fully.

Here's the code:

standardSQL

SELECT
      reference_name, start_position, end_position, reference_bases,
      call.name AS call_name
FROM
      `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
      t.call AS call

I understand why we would write "call.name AS call_name". But what I don't understand why we write "t.call AS call"? What does the letter "t" stand for here? Why do we write t.call? Why do we need need to create an alias with "AS call"?

Thank you for your help!

I'm baffled. "Flattening" refers to fields that are repeated -- arrays. You use unnest to flatten them. That would be:

SELECT reference_name, start_position, end_position, reference_bases,
       call.name AS call_name
FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` t CROSS JOIN
      UNNEST(t.calls) AS call

This says that:

  • calls is a repeated field in your table.
  • calls is a struct.
  • calls has a column called name .

I renamed the call column calls because it contains multiple values -- and to distinguish it from call in the FROM clause.

The name column is being returned in the outer query.

Here is an example you can run:

with `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` as (
       select 'a' as reference_name, 1 as start_position, 5 as end_position, 'b' as reference_bases,
              array[struct('name1' as name),
                    struct('name2' as name),
                    struct('name3' as name)
                   ] calls                   
      )
SELECT reference_name, start_position, end_position, reference_bases,
       call.name AS call_name
FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` t CROSS JOIN
      UNNEST(t.calls) AS call;

But what I don't understand why we write "t.call AS call"? What does the letter "t" stand for here? Why do we write t.call?

Letter t is the alias for table in above row. In t.call it fully qualifies path to call inner table - call is repeated field so along with t it is treated as a 'virtual' table that you can use in joins - in this case it is CROSS JOIN (as comma is a shortcut to CROSS JOIN)

Why do we need need to create an alias with "AS call"?

In SELECT statement there is a reference to call.name which means that you output name field from that call 'virtual table'. This is important in cases when there is another name column let's say in main table, so you avoid ambiguity error
Note: if there is no any other column named name - you can skip call as in below example

SELECT
      reference_name, start_position, end_position, reference_bases,
      name AS call_name
FROM
      `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
      t.call  

Try this link .

You can achieve this using unnest:

SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE`,
  UNNEST(call) AS call

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM