简体   繁体   中英

How to select value from a table where value is a dictionary, and the key of the dictionary is a number using SQL query?

So, I have a table from BigQuery public tables (Google Analytics):

SELECT hits.0.productName
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
where date between '20160101' and '20161231'

Additional code:

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='/Users/<UserName>/Desktop/folder/key/<Key_name>.json'
bigquery_client = bigquery.Client()

ERROR in Jupiter Notebook:

BadRequest                                Traceback (most recent call last)
<ipython-input-31-424833cf8827> in <module>
----> 1 print(bigquery_client.query(
      2 """
      3 SELECT hits.0.productName
      4 from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
      5 where date between '20160101' and '20161231'

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client, date_as_object, max_results, geography_as_object)
   1563                 :mod:`shapely` library cannot be imported.
   1564         """
-> 1565         query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
   1566         return query_result.to_dataframe(
   1567             bqstorage_client=bqstorage_client,

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/_tqdm_helpers.py in wait_for_query(query_job, progress_bar_type, max_results)
     86     )
     87     if progress_bar is None:
---> 88         return query_job.result(max_results=max_results)
     90     i = 0

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in result(self, page_size, max_results, retry, timeout, start_index, job_retry)
   1370                 do_get_result = job_retry(do_get_result)
-> 1372             do_get_result()
   1374         except exceptions.GoogleAPICallError as exc:

~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/retry.py in retry_wrapped_func(*args, **kwargs)
    281                 self._initial, self._maximum, multiplier=self._multiplier
    282             )
--> 283             return retry_target(
    284                 target,
    285                 self._predicate,

~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/retry.py in retry_target(target, predicate, sleep_generator, deadline, on_error)
    188     for sleep in sleep_generator:
    189         try:
--> 190             return target()
    192         # pylint: disable=broad-except

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in do_get_result()
   1360                     self._job_retry = job_retry
-> 1362                 super(QueryJob, self).result(retry=retry, timeout=timeout)
   1364                 # Since the job could already be "done" (e.g. got a finished job

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/base.py in result(self, retry, timeout)
    712         kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
--> 713         return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
    715     def cancelled(self):

~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/future/polling.py in result(self, timeout, retry)
    135             # pylint: disable=raising-bad-type
    136             # Pylint doesn't recognize that this is valid in this case.
--> 137             raise self._exception
    139         return self._result

BadRequest: 400 Syntax error: Unexpected keyword WHERE at [4:1]

(job ID: 3c15e031-ee7d-4594-a577-0237f8282695)

                    -----Query Job SQL Follows-----                    

    |    .    |    .    |    .    |    .    |    .    |    .    |
   2:SELECT hits.0.productName
   3:from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
   4:where date between '20160101' and '20161231'
    |    .    |    .    |    .    |    .    |    .    |    .    |

As seen at the screenshot I have hits column, which value is a dictionary and I need to fetch the inner dictionary value from '0' column, but there is the error. Actually, I need to take 'productName' values from all numeric columns. 在此处输入图像描述

An approach you can take to solve this will filter the data you want directly in the query.

Filtering from BigQuery:

First for a better understanding, take a look on the data schema for the fields that contains product names:


  • The first possible field could be hits.item.productName

    • hits is a RECORD
    • item is a RECORD inside item
    • productName is the string hits.item
  • The second field could be hits.product.v2ProductName

    • product is a RECORD inside item
    • v2ProductName is the string hits.product For query a RECORD , you have to 'flat' is, turning it into a table using the expression UNNEST([record]) as described here : So to return all the unique product names from hits.product.v2ProductName query:
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
sql = """
DISTINCT p.v2productname
UNNEST(product) AS p
date  BETWEEN  '20160101'
AND  '20161231'
AND (p.v2productname IS  NOT  NULL);
v2productname = client.query(sql).to_dataframe()

For use the field hits.item.productName run the following, but all records are null :

DISTINCT h.item.productname
UNNEST(hits) AS h,
UNNEST(product) AS p
date  BETWEEN  '20160101'
AND  '20161231'
AND (h.item.productname IS  NOT  NULL);

Filtering from the dataframe:

I tried to process it using a dataframe but its not possible due to the chain of records in the datasets, the function to_dataframe() is not able to process it.

In resume:

Try to filter and process as much of the data as possible in the BigQuery, it will faster and more cost effectively.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM