简体   繁体   English

如何在 pandas 中实现大查询

[英]how to implement big query in pandas

I have a dataframe df having all data from the table(select * from table) but how to implement below logic?我有一个 dataframe df 具有表中的所有数据(从表中选择 *)但是如何实现以下逻辑?

select
    max(ga_session_id) over (partition by event_timestamp,event_name) as ga_session_id,
    concat(user_pseudo_id,event_timestamp,event_name,dedup_id) as join_key,
    * except(ga_session_id)
from (
    select
        user_pseudo_id,
        case when event_params.key = 'ga_session_id' then event_params.value.int_value else null end as ga_session_id,
        event_timestamp,
        event_name,
        event_params.key,
        event_params.value.string_value,
        event_params.value.int_value,
        event_params.value.float_value,
        event_params.value.double_value,
        dedup_id
    from (
        select
            row_number() over(partition by user_pseudo_id, event_timestamp, event_name) as dedup_id,
            *
        from
            -- change this to your google analytics 4 export location in bigquery
            `ga4bigquery.analytics_250794857.events_*`
        where
            -- define static and/or dynamic start and end date
            _table_suffix between '20201201' and format_date('%Y%m%d',date_sub(current_date(), interval 1 day))),
        unnest(event_params) as event_params)

Understanding your question as "I have all my BQ table's raw data in a dataframe - how do I replicate the logic of this SQL statement in Python?":将您的问题理解为“我在 dataframe 中拥有所有 BQ 表的原始数据 - 如何在 ZA7F5F35426B927411FC9231B5638217 中复制此 SQL 语句的逻辑?”

The short answer is: this isn't a good idea.简短的回答是:这不是一个好主意。 BigQuery will be a lot more efficient at processing this data than Python will (we're talking seconds vs hours as soon as you pass the ~10 million row mark). BigQuery 在处理这些数据方面的效率将比 Python 高得多(当您通过约 1000 万行标记时,我们正在谈论几秒钟与几小时)。

Without knowing more about your use case, what would probably be best is to use the Python BigQuery client to run your SQL and import the results directly into a dataframe.在不了解您的用例的情况下,最好使用 Python BigQuery 客户端运行您的 SQL 并将结果直接导入 Z6A8064B5DF479455500553C47C5507。

Your code would look something like:您的代码将类似于:

from google.cloud import bigquery

client = bigquery.Client()
QUERY = (
    '... Your query '
    )
df = client.query(QUERY).to_dataframe()

See https://pypi.org/project/google-cloud-bigquery/ for more details on how to authenticate, documentation, etc.有关如何进行身份验证、文档等的更多详细信息,请参阅https://pypi.org/project/google-cloud-bigquery/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM