简体   繁体   English

Bigquery SQL代码可提早联系

[英]Bigquery SQL code to pull earliest contact

I have a copy of our salesforce data in bigquery, I'm trying to join the contact table together with the account table. 我在bigquery中有一份salesforce数据的副本,我正在尝试将联系表与客户表一起加入。

I want to return every account in the dataset but I only want the contact that was created first for each account. 我想返回数据集中的每个客户,但我只希望为每个客户首先创建的联系人。

I've gone around and around in circles today googling and trying to cobble a query together but all roads either lead to no accounts, a single account or loads of contacts per account (ignoring the earliest requirement). 今天,我到处走来走去,谷歌搜索并试图拼凑一个查询,但是所有道路要么没有账户,要么只有一个账户,要么每个账户都有大量联系人(忽略最早的要求)。

Here's the latest query. 这是最新的查询。 that produces no results. 没有结果。 I think I'm nearly there but still struggling. 我想我快到了,但仍在挣扎。 any help would be most appreciated. 非常感激任何的帮助。


SELECT distinct  
 c.accountid as Acct_id 
,a.id as a_Acct_ID
,c.id as Cont_ID
,a.id AS a_CONT_ID 
,c.email
,c.createddate

FROM `sfdcaccounttable` a

INNER JOIN `sfdccontacttable` c
ON c.accountid = a.id

INNER JOIN
    (SELECT a2.id, c2.accountid, c2.createddate AS MINCREATEDDATE
    FROM `sfdccontacttable` c2

    INNER JOIN `sfdcaccounttable` a2 ON a2.id = c2.accountid

 GROUP BY 1,2,3
 ORDER BY c2.createddate asc LIMIT 1) c3 
ON c.id = c3.id

ORDER BY a.id asc

LIMIT 10

The solution shared above is very BigQuery specific: it does have some quirks you need to work around like the memory error you got. 上面共享的解决方案非常特定于BigQuery:它确实存在一些需要解决的怪癖,例如遇到的内存错误。

I once answered a similar question here that is more portable and easier to maintain. 我曾经在这里回答过一个类似的问题该问题更便于携带和维护。

Essentially you need to create a smaller table(even better to make it a view) with the ID and it's first transaction. 本质上,您需要使用ID和它的第一个事务来创建一个较小的表(甚至使其更易于查看)。 It's similar to what you shared by slightly different as you need to group ONLY in the topmost query. 它与您共享的内容相似,只是您只需要在最顶层的查询中进行分组。

It looks something like this 看起来像这样

select 
# contact ids that are first time contacts
b.id as cont_id,
b.accountid

from `sfdccontacttable` as b inner join 
(   select accountid,
    min(createddate) as first_tx_time
    FROM `sfdccontacttable`  
    group by 1) as a on (a.accountid = b.accountid and b.createddate = a.first_tx_time)
group by 1, 2

You need to do it this way because otherwise you can end up with multiple IDs per account (if there are any other dimensions associated with it). 您需要这样做,因为否则每个帐户可能会获得多个ID(如果还有其他关联的维度)。 This way also it is kinda future proof as you can have multiple dimensions added to the underlying tables without affecting the result and also you can use a where clause in the inner query to define a "valid" contact and so on. 这样,还可以作为将来的证明,因为您可以在不影响结果的情况下将多个维度添加到基础表中,并且还可以在内部查询中使用where子句来定义“有效”联系人,依此类推。 You can then save that as a view and simply reference it in any subquery or join operation 然后,您可以将其另存为视图,并在任何子查询或联接操作中简单地引用它

Setup a view/subquery for client_first or client_last client_firstclient_last设置视图/子查询

as: 如:

SELECT * except(_rank) from (
  select rank() over (partition by accountid order by createddate  ASC) as _rank, 
   * 
   FROM `prj.dataset.sfdccontacttable`  
)  where _rank=1

basically it uses a Window function to number the rows, and return the first row, using ASC that's first client, using DESC that's last client entry. 基本上,它使用Window函数对行进行编号,并使用第一个客户端的ASC和最后一个客户端条目的DESC返回第一行。

You can do that same for accounts as well, then you can join two simple, as exactly 1 record will be for each entity. 您也可以对accounts执行相同的操作,然后可以加入两个简单的accounts ,因为每个实体恰好有1条记录。

UPDATE UPDATE

You could also try using ARRAY_AGG which has less memory footprint. 您也可以尝试使用内存占用更少的ARRAY_AGG。

#standardSQL
SELECT e.* FROM (
  SELECT ARRAY_AGG(
    t ORDER BY t.createddate ASC LIMIT 1
  )[OFFSET(0)]  e
  FROM `dataset.sfdccontacttable` t 
  GROUP BY t.accountid 
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM