简体   繁体   English

如何在 b 中没有函数 (a) 的情况下连接表

[英]how to join tables on cases where none of function(a) in b

Say in M.netDB (specifically, the embedded version from the "M.netDBLite" R package) I have a table "events" containing entity ID codes and event start and end dates, of the format:在 M.netDB 中(具体来说,来自“M.netDBLite”R 包的嵌入式版本)我有一个包含实体 ID 代码和事件开始和结束日期的表“事件”,格式为:

| id  | start_date  | end_date   |
| 1   | 2010-01-01  | 2010-03-30 |
| 1   | 2010-04-01  | 2010-06-30 |
| 2   | 2018-04-01  | 2018-06-30 |
| ... | ...         | ...        |

The table is approximately 80 million rows of events, attributable to approximately 2.5 million unique entities (ID values).该表包含大约 8000 万行事件,归因于大约 250 万个唯一实体(ID 值)。 The dates appear to align nicely with calendar quarters, but I haven't thoroughly checked them so assume they can be arbitrary.日期似乎与日历季度很好地对齐,但我没有彻底检查它们,所以假设它们可以是任意的。 However, I have at least sense-checked them for end_date > start_date.但是,我至少对它们进行了 end_date > start_date 的感官检查。

I want to produce a table "nonevent_qtrs" listing calendar quarters where an ID has no event recorded, eg :我想生成一个表“nonevent_qtrs”,列出其中 ID没有记录事件的日历季度,例如

| id  | last_doq   |
| 1   | 2010-09-30 |
| 1   | 2010-12-31 |
| ... | ...        |
| 1   | 2018-06-30 |
| 2   | 2010-03-30 |
| ... | ...        |

(doq = day of quarter) (doq = 每季度的日期)

If the extent of an event spans any days of the quarter (including the first and last dates), then I wish for it to count as having occurred in that quarter.如果事件的范围跨越该季度的任何几天(包括第一天和最后一天),那么我希望它算作在该季度发生。

To help with this, I have produced a "calendar table";为了解决这个问题,我制作了一个“日历表”; a table of quarters "qtrs", covering the entire span of dates present in "events", and of the format:季度表“qtrs”,涵盖“事件”中出现的整个日期范围,格式为:

| first_doq  | last_doq   |
| 2010-01-01 | 2010-03-30 |
| 2010-04-01 | 2010-06-30 |
| ...        | ...        |

And tried using a non-equi merge like so:并尝试像这样使用非 equi 合并:

create table nonevents
as select
    id,
    last_doq
from
    events
    full outer join
    qtrs
on
    start_date > last_doq or
    end_date < first_doq
group by
    id,
    last_doq

But this is a) terribly inefficient and b) certainly wrong, since most IDs are listed as being non-eventful for all quarters.但这是 a) 非常低效和 b) 肯定是错误的,因为大多数 ID 都被列为所有季度都不会发生大事。

How can I produce the table "nonevent_qtrs" I described, which contains a list of quarters for which each ID had no events?如何生成我描述的表“nonevent_qtrs”,其中包含每个 ID 没有事件的季度列表?

If it's relevant, the ultimate use-case is to calculate runs of non-events to look at time-till-event analysis and prediction.如果相关的话,最终用例是计算非事件的运行以查看事件发生时间分析和预测。 Feels like run length encoding will be required.感觉需要运行长度编码。 If there's a more direct approach than what I've described above then I'm all ears.如果有比我上面描述的更直接的方法,那么我会洗耳恭听。 The only reason I'm focusing on non-event runs to begin with is to try to limit the size of the cross-product.我一开始就关注非事件运行的唯一原因是试图限制叉积的大小。 I've also considered producing something like:我也考虑过制作类似的东西:

| id  | last_doq   | event |
| 1   | 2010-01-31 | 1     |
| ... | ...        | ...   |
| 1   | 2018-06-30 | 0     |
| ... | ...        | ...   |

But although more useful this may not be feasible due to the size of the data involved.但是,尽管更有用,但由于涉及的数据量大,这可能不可行。 A wide format:宽格式:

| id  | 2010-01-31 | ... | 2018-06-30 |
| 1   | 1          | ... | 0          |
| 2   | 0          | ... | 1          |
| ... | ...        | ... | ...        |

would also be handy, but since M.netDB is column-store I'm not sure whether this is more or less efficient.也会很方便,但由于 M.netDB 是列存储,我不确定这是否更有效。

Let me assume that you have a table of quarters, with the start date of a quarter and the end date.让我假设您有一个季度表,其中包含一个季度的开始日期和结束日期。 You really need this if you want the quarters that don't exist.如果你想要存在的宿舍,你真的需要这个。 After all, how far back in time or forward in time do you want to go?毕竟,您想要 go 回到过去多远或向前多远?

Then, you can generate all id/quarter combinations and filter out the ones that exist:然后,您可以生成所有 id/quarter 组合并过滤掉存在的组合:

select i.id, q.*
from (select distinct id from events) i cross join
     quarters q left join
     events e
     on e.id = i.id and
        e.start_date <= q.quarter_end and
        e.end_date >= q.quarter_start
where e.id is null;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM