So, I have a table from BigQuery public tables (Google Analytics):
print(bigquery_client.query(
"""
SELECT hits.0.productName
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
where date between '20160101' and '20161231'
""").to_dataframe())
Additional code:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='/Users/<UserName>/Desktop/folder/key/<Key_name>.json'
bigquery_client = bigquery.Client()
ERROR in Jupiter Notebook:
BadRequest Traceback (most recent call last)
<ipython-input-31-424833cf8827> in <module>
----> 1 print(bigquery_client.query(
2 """
3 SELECT hits.0.productName
4 from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
5 where date between '20160101' and '20161231'
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client, date_as_object, max_results, geography_as_object)
1563 :mod:`shapely` library cannot be imported.
1564 """
-> 1565 query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
1566 return query_result.to_dataframe(
1567 bqstorage_client=bqstorage_client,
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/_tqdm_helpers.py in wait_for_query(query_job, progress_bar_type, max_results)
86 )
87 if progress_bar is None:
---> 88 return query_job.result(max_results=max_results)
89
90 i = 0
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in result(self, page_size, max_results, retry, timeout, start_index, job_retry)
1370 do_get_result = job_retry(do_get_result)
1371
-> 1372 do_get_result()
1373
1374 except exceptions.GoogleAPICallError as exc:
~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/retry.py in retry_wrapped_func(*args, **kwargs)
281 self._initial, self._maximum, multiplier=self._multiplier
282 )
--> 283 return retry_target(
284 target,
285 self._predicate,
~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/retry.py in retry_target(target, predicate, sleep_generator, deadline, on_error)
188 for sleep in sleep_generator:
189 try:
--> 190 return target()
191
192 # pylint: disable=broad-except
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in do_get_result()
1360 self._job_retry = job_retry
1361
-> 1362 super(QueryJob, self).result(retry=retry, timeout=timeout)
1363
1364 # Since the job could already be "done" (e.g. got a finished job
~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/base.py in result(self, retry, timeout)
711
712 kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
--> 713 return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
714
715 def cancelled(self):
~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/future/polling.py in result(self, timeout, retry)
135 # pylint: disable=raising-bad-type
136 # Pylint doesn't recognize that this is valid in this case.
--> 137 raise self._exception
138
139 return self._result
BadRequest: 400 Syntax error: Unexpected keyword WHERE at [4:1]
(job ID: 3c15e031-ee7d-4594-a577-0237f8282695)
-----Query Job SQL Follows-----
| . | . | . | . | . | . |
1:
2:SELECT hits.0.productName
3:from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
4:where date between '20160101' and '20161231'
| . | . | . | . | . | . |
As seen at the screenshot I have hits column, which value is a dictionary and I need to fetch the inner dictionary value from '0' column, but there is the error. Actually, I need to take 'productName' values from all numeric columns.
An approach you can take to solve this will filter the data you want directly in the query.
First for a better understanding, take a look on the data schema for the fields that contains product names:
The first possible field could be hits.item.productName
RECORD
RECORD
inside item productName
is the string hits.item
The second field could be hits.product.v2ProductName
RECORD
inside item v2ProductName
is the string hits.product
For query a RECORD
, you have to 'flat' is, turning it into a table using the expression UNNEST([record])
as described here : So to return all the unique product names from hits.product.v2ProductName
query:
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
sql = """
SELECT
DISTINCT p.v2productname
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(product) AS p
WHERE
date BETWEEN '20160101'
AND '20161231'
AND (p.v2productname IS NOT NULL);
"""
v2productname = client.query(sql).to_dataframe()
print(v2productname)
For use the field hits.item.productName
run the following, but all records are null
:
SELECT
DISTINCT h.item.productname
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS h,
UNNEST(product) AS p
WHERE
date BETWEEN '20160101'
AND '20161231'
AND (h.item.productname IS NOT NULL);
I tried to process it using a dataframe but its not possible due to the chain of records in the datasets, the function to_dataframe()
is not able to process it.
Try to filter and process as much of the data as possible in the BigQuery, it will faster and more cost effectively.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.