如何從值為字典的表中獲取 select 值，字典的鍵是使用 SQL 查詢的數字？

Question

所以，我有一個來自 BigQuery 公共表（Google Analytics）的表：

print(bigquery_client.query(
"""
SELECT hits.0.productName
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
where date between '20160101' and '20161231'
""").to_dataframe())

附加代碼：

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='/Users/<UserName>/Desktop/folder/key/<Key_name>.json'
bigquery_client = bigquery.Client()

木星筆記本中的錯誤：

BadRequest                                Traceback (most recent call last)
<ipython-input-31-424833cf8827> in <module>
----> 1 print(bigquery_client.query(
      2 """
      3 SELECT hits.0.productName
      4 from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
      5 where date between '20160101' and '20161231'

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client, date_as_object, max_results, geography_as_object)
   1563                 :mod:`shapely` library cannot be imported.
   1564         """
-> 1565         query_result = wait_for_query(self, progress_bar_type, max_results=max_results)
   1566         return query_result.to_dataframe(
   1567             bqstorage_client=bqstorage_client,

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/_tqdm_helpers.py in wait_for_query(query_job, progress_bar_type, max_results)
     86     )
     87     if progress_bar is None:
---> 88         return query_job.result(max_results=max_results)
     89 
     90     i = 0

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in result(self, page_size, max_results, retry, timeout, start_index, job_retry)
   1370                 do_get_result = job_retry(do_get_result)
   1371 
-> 1372             do_get_result()
   1373 
   1374         except exceptions.GoogleAPICallError as exc:

~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/retry.py in retry_wrapped_func(*args, **kwargs)
    281                 self._initial, self._maximum, multiplier=self._multiplier
    282             )
--> 283             return retry_target(
    284                 target,
    285                 self._predicate,

~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/retry.py in retry_target(target, predicate, sleep_generator, deadline, on_error)
    188     for sleep in sleep_generator:
    189         try:
--> 190             return target()
    191 
    192         # pylint: disable=broad-except

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py in do_get_result()
   1360                     self._job_retry = job_retry
   1361 
-> 1362                 super(QueryJob, self).result(retry=retry, timeout=timeout)
   1363 
   1364                 # Since the job could already be "done" (e.g. got a finished job

~/opt/anaconda3/lib/python3.8/site-packages/google/cloud/bigquery/job/base.py in result(self, retry, timeout)
    711 
    712         kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
--> 713         return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
    714 
    715     def cancelled(self):

~/opt/anaconda3/lib/python3.8/site-packages/google/api_core/future/polling.py in result(self, timeout, retry)
    135             # pylint: disable=raising-bad-type
    136             # Pylint doesn't recognize that this is valid in this case.
--> 137             raise self._exception
    138 
    139         return self._result

BadRequest: 400 Syntax error: Unexpected keyword WHERE at [4:1]

(job ID: 3c15e031-ee7d-4594-a577-0237f8282695)

                    -----Query Job SQL Follows-----                    

    |    .    |    .    |    .    |    .    |    .    |    .    |
   1:
   2:SELECT hits.0.productName
   3:from `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
   4:where date between '20160101' and '20161231'
    |    .    |    .    |    .    |    .    |    .    |    .    |

如屏幕截圖所示，我有 hits 列，該值是一個字典，我需要從“0”列中獲取內部字典值，但出現錯誤。 實際上，我需要從所有數字列中獲取“productName”值。

Answer 1

您可以采取的解決方法是直接在查詢中過濾您想要的數據。

從 BigQuery 過濾：

首先，為了更好地理解，請查看包含產品名稱的字段的數據模式：

圖片模式

第一個可能的字段可以是hits.item.productName
- 點擊是一個RECORD
- item 是一個RECORD inside item
- productName是字符串hits.item
第二個字段可以是hits.product.v2ProductName
- product is a RECORD inside 項目
- v2ProductName是字符串hits.product對於查詢RECORD ，您必須“平坦”是，使用表達式UNNEST([record])將其轉換為表，如此處所述：因此要從hits.product.v2ProductName返回所有唯一產品名稱hits.product.v2ProductName查詢：

 
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
 
sql = """
SELECT
DISTINCT p.v2productname
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(product) AS p
WHERE
date  BETWEEN  '20160101'
AND  '20161231'
AND (p.v2productname IS  NOT  NULL);
"""
v2productname = client.query(sql).to_dataframe()
print(v2productname)

要使用字段hits.item.productName運行以下命令，但所有記錄都是null ：

SELECT
DISTINCT h.item.productname
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS h,
UNNEST(product) AS p
WHERE
date  BETWEEN  '20160101'
AND  '20161231'
AND (h.item.productname IS  NOT  NULL);

從dataframe過濾：

我嘗試使用 dataframe 處理它，但由於數據集中的記錄鏈，function to_dataframe()無法處理它。

在簡歷中：

嘗試在 BigQuery 中過濾和處理盡可能多的數據，它會更快且更具成本效益。

如何從值為字典的表中獲取 select 值，字典的鍵是使用 SQL 查詢的數字？

問題描述

1 個解決方案

解決方案1
1 2021-10-07 08:13:58

從 BigQuery 過濾：

從dataframe過濾：

在簡歷中：

如何從值為字典的表中獲取 select 值，字典的鍵是使用 SQL 查詢的數字？

問題描述

1 個解決方案

解決方案1 1 2021-10-07 08:13:58

從 BigQuery 過濾：

從dataframe過濾：

在簡歷中：

解決方案1
1 2021-10-07 08:13:58